Load Models

Load models

Load trained vision-language models from Datature Vi or HuggingFace for inference.

Overview

The Vi SDK provides the ViModel class as the recommended way to load models and run inference. It handles:

Automatic downloading - Downloads models from Datature Vi
Caching - Caches models locally to avoid re-downloading
Device management - Automatically detects and uses available GPUs
Memory optimization - Supports quantization and efficient loading
Error handling - Graceful error handling and validation

Basic Model Loading

From Datature Vi

Load a model trained on Datature Vi using your credentials and run ID:

from vi.inference import ViModel

# Load model with credentials and run ID
model = ViModel(
    run_id="your-run-id",
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

💡
Using Environment Variables
For better security, set credentials as environment variables:
export DATATURE_VI_SECRET_KEY="your-secret-key"
export DATATURE_VI_ORGANIZATION_ID="your-organization-id"
Then initialize without explicit credentials:
model = ViModel(run_id="your-run-id")

From HuggingFace

Load models directly from the HuggingFace model hub:

# Load from HuggingFace model hub
model = ViModel(
    pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct"
)

From Local Path

Load models from a local directory:

# Load from local path
model = ViModel(
    pretrained_model_name_or_path="./path/to/local/model"
)

Device Management

Automatic Device Selection

By default, ViModel automatically detects and uses available GPUs:

# Automatically use GPU if available
model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    device_map="auto"  # Default
)

Manual Device Selection

Specify the device explicitly:

import torch

# Check available devices
if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
    device = "cuda"
elif torch.backends.mps.is_available():  # Apple Silicon
    print("Apple Silicon GPU available")
    device = "mps"
else:
    print("Using CPU")
    device = "cpu"

# Load on specific device
model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    device_map=device
)

Multi-GPU Distribution

Automatically distribute model across multiple GPUs:

# Distribute across all available GPUs
model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    device_map="auto"  # Automatically distributes
)

Memory Optimization

Quantization

Reduce memory usage with quantization:

8-bit Quantization

Reduces memory by approximately 50%:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    load_in_8bit=True,
    device_map="auto"
)

Benefits:

~50% memory reduction
Minimal accuracy loss
Faster inference

4-bit Quantization

Reduces memory by approximately 75%:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    load_in_4bit=True,
    device_map="auto"
)

Benefits:

~75% memory reduction
Slight accuracy tradeoff
Significantly faster inference
Enables loading larger models

⚠️
Quantization Notes

Cannot use both load_in_8bit and load_in_4bit simultaneously

Quantization requires bitsandbytes library (included with vi-sdk[inference])

Some layers may not support quantization and will remain in original precision

Mixed Precision

Use FP16 for faster inference with minimal quality loss:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    dtype="float16",  # Half precision
    device_map="auto"
)

Supported dtypes:

"float32" - Full precision (default)
"float16" - Half precision (recommended for GPUs)
"bfloat16" - Brain float 16 (better range than float16)

Low CPU Memory Usage

Optimize CPU memory usage during loading:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

Advanced Loading Options

Flash Attention 2

Enable Flash Attention 2 for faster inference:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    attn_implementation="flash_attention_2"
)

ℹ️
Flash Attention 2 Requirements

Requires flash-attn package installed

Only supported on compatible GPUs (Ampere or newer)

Provides significant speedup for long sequences

Falls back to standard attention if not available

Complete Configuration Example

Combine multiple optimization options:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    # Device management
    device_map="auto",
    # Memory optimization
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    # Performance optimization
    attn_implementation="flash_attention_2",
    dtype="float16",
)

Model Caching

How Caching Works

Models are automatically cached locally after first download:

# First time - downloads model
model1 = ViModel(run_id="your-run-id")  # Downloads

# Second time - uses cached model
model2 = ViModel(run_id="your-run-id")  # Uses cache (faster)

Default cache location:

~/.datature/vi/models/
└── <run-id>/
    ├── model_full/          # Full model weights
    ├── adapter/             # Adapter weights (if available)
    └── run_config.json      # Training configuration

Custom Save Path

Specify a custom directory for cached models:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    save_path="./my_models"  # Custom cache directory
)

Force Re-download

Force a fresh download even if model is cached:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    overwrite=True  # Force re-download
)

When to use:

Model was updated after initial download
Cache corruption suspected
Testing model changes

Inspecting Models

Inspect Before Loading

Check model metadata without loading the full model:

from vi.inference import ViModel

# Inspect model metadata
info = ViModel.inspect(
    secret_key="your-key",
    organization_id="your-org",
    run_id="your-run-id"
)

print(f"Model: {info.model_name}")
print(f"Size: {info.size_gb:.2f} GB")
print(f"Architecture: {info.architecture}")
print(f"Task: {info.task_type}")

# Decide whether to load based on requirements
if info.size_gb < 10 and info.task_type == "VQA":
    # Load the model
    model = ViModel(
        secret_key="your-key",
        organization_id="your-org",
        run_id="your-run-id"
    )
else:
    print("Model doesn't meet requirements")

Benefits:

Fast - only reads configuration files
No GPU memory required
Helps plan resource requirements
Useful for automated workflows

Inspect Local Model

Inspect a model from local path:

# Inspect local or HuggingFace model
info = ViModel.inspect(
    pretrained_model_name_or_path="./path/to/model"
)

print(info)

Supported Model Architectures

The Vi SDK currently supports the following model architectures:

Qwen2.5-VL

Model class: Qwen2_5_VLForConditionalGeneration

model = ViModel(run_id="your-qwen-run-id")

Features:

Visual Question Answering (VQA)
Phrase Grounding
Multi-turn conversations
High-resolution image support

InternVL 3.5

Model class: InternVLForConditionalGeneration

model = ViModel(run_id="your-internvl-run-id")

Features:

Visual Question Answering (VQA)
Phrase Grounding
Multi-modal understanding
Long context support

Cosmos Reason1

Model class: Qwen2_5_VLForConditionalGeneration

model = ViModel(run_id="your-cosmos-run-id")

Features:

Visual Question Answering (VQA)
Phrase Grounding
Advanced reasoning
Complex visual understanding

NVILA

Architecture: NVILA

model = ViModel(run_id="your-nvila-run-id")

Features:

Visual Question Answering (VQA)
Phrase Grounding
Optimized for NVIDIA GPUs
Efficient inference

ℹ️
Coming Soon
The following model architectures will be supported in upcoming releases:

DeepSeek OCR — Specialized OCR model for document understanding

LLaVA-NeXT — Advanced multimodal reasoning with improved visual comprehension

Check the changelog for updates.

Loading Best Practices

1. Reuse Model Instances

Create model once and reuse for multiple inferences:

# ✅ Good - create once, reuse
model = ViModel(run_id="your-run-id")

for image in images:
    result, error = model(source=image)

# ❌ Bad - recreate for each image (wasteful)
for image in images:
    model = ViModel(run_id="your-run-id")  # Don't do this
    result, error = model(source=image)

2. Choose Appropriate Quantization

Balance memory usage and quality:

# For GPUs with limited memory
model = ViModel(run_id="your-run-id", load_in_8bit=True)

# For very limited memory or large models
model = ViModel(run_id="your-run-id", load_in_4bit=True)

# For best quality (if memory allows)
model = ViModel(run_id="your-run-id")  # No quantization

3. Use Appropriate Precision

Choose dtype based on your hardware:

# For modern GPUs (Ampere or newer)
model = ViModel(run_id="your-run-id", dtype="float16")

# For CPUs or older GPUs
model = ViModel(run_id="your-run-id", dtype="float32")

# For TPUs or specific architectures
model = ViModel(run_id="your-run-id", dtype="bfloat16")

4. Monitor Memory Usage

Track memory consumption:

import torch

# Load model
model = ViModel(run_id="your-run-id")

# Check memory usage
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")

5. Handle Loading Errors

Implement proper error handling:

try:
    model = ViModel(
        secret_key="your-secret-key",
        organization_id="your-organization-id",
        run_id="your-run-id"
    )
    print("Model loaded successfully")
except ValueError as e:
    print(f"Failed to load model: {e}")
    # Handle error appropriately
except Exception as e:
    print(f"Unexpected error: {e}")

Common Loading Scenarios

Development Environment

Optimize for fast iteration:

# Fast loading with caching
model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,  # Reduce memory
    low_cpu_mem_usage=True
)

Production Environment

Optimize for performance and reliability:

# Production-optimized loading
model = ViModel(
    secret_key=os.getenv("VI_SECRET_KEY"),
    organization_id=os.getenv("VI_ORG_ID"),
    run_id="your-run-id",
    device_map="auto",
    dtype="float16",
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True
)

Resource-Constrained Environment

Maximize memory efficiency:

# Maximum memory efficiency
model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,  # Maximum compression
    low_cpu_mem_usage=True,
    device_map="auto"
)

Multi-GPU Server

Distribute across GPUs:

# Automatic distribution across GPUs
model = ViModel(
    run_id="your-run-id",
    device_map="auto",  # Auto-distributes
    dtype="float16",
    attn_implementation="flash_attention_2"
)

Troubleshooting

Out of Memory During Loading

If you encounter OOM errors:

# Try 8-bit quantization first
model = ViModel(run_id="your-run-id", load_in_8bit=True)

# If still OOM, try 4-bit quantization
model = ViModel(run_id="your-run-id", load_in_4bit=True)

# Enable low CPU memory usage
model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,
    low_cpu_mem_usage=True
)

Model Download Fails

Check connection and credentials:

try:
    model = ViModel(
        secret_key="your-secret-key",
        organization_id="your-organization-id",
        run_id="your-run-id"
    )
except ValueError as e:
    if "Failed to download model" in str(e):
        print("Check your credentials and run ID")
        print("Ensure the model has finished training")

Slow Loading

Improve loading speed:

# Use cached models (default)
model = ViModel(run_id="your-run-id")  # Fast if cached

# Enable low CPU memory usage
model = ViModel(
    run_id="your-run-id",
    low_cpu_mem_usage=True
)

More troubleshooting →

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects

Load models

Overview

Basic Model Loading

From Datature Vi

Using Environment Variables

From HuggingFace

From Local Path

Device Management

Automatic Device Selection

Manual Device Selection

Multi-GPU Distribution

Memory Optimization

Quantization

8-bit Quantization

4-bit Quantization

Quantization Notes

Mixed Precision

Low CPU Memory Usage

Advanced Loading Options

Flash Attention 2

Flash Attention 2 Requirements

Complete Configuration Example

Model Caching

How Caching Works

Custom Save Path

Force Re-download

Inspecting Models

Inspect Before Loading

Inspect Local Model

Supported Model Architectures

Qwen2.5-VL

InternVL 3.5

Cosmos Reason1

NVILA

Coming Soon

Loading Best Practices

1. Reuse Model Instances

2. Choose Appropriate Quantization

3. Use Appropriate Precision

4. Monitor Memory Usage

5. Handle Loading Errors

Common Loading Scenarios

Development Environment

Production Environment

Resource-Constrained Environment

Multi-GPU Server

Troubleshooting

Out of Memory During Loading

Model Download Fails

Slow Loading

See also

Need help?