Load Models

Load models

Load trained vision-language models from Datature Vi or HuggingFace for inference.

Overview

The Vi SDK provides the ViModel class as the recommended way to load models and run inference. It handles:

  • Automatic downloading - Downloads models from Datature Vi
  • Caching - Caches models locally to avoid re-downloading
  • Device management - Automatically detects and uses available GPUs
  • Memory optimization - Supports quantization and efficient loading
  • Error handling - Graceful error handling and validation

Basic Model Loading

From Datature Vi

Load a model trained on Datature Vi using your credentials and run ID:

from vi.inference import ViModel

# Load model with credentials and run ID
model = ViModel(
    run_id="your-run-id",
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)
💡

Using Environment Variables

For better security, set credentials as environment variables:

export DATATURE_VI_SECRET_KEY="your-secret-key"
export DATATURE_VI_ORGANIZATION_ID="your-organization-id"

Then initialize without explicit credentials:

model = ViModel(run_id="your-run-id")

From HuggingFace

Load models directly from the HuggingFace model hub:

# Load from HuggingFace model hub
model = ViModel(
    pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct"
)

From Local Path

Load models from a local directory:

# Load from local path
model = ViModel(
    pretrained_model_name_or_path="./path/to/local/model"
)

Device Management

Automatic Device Selection

By default, ViModel automatically detects and uses available GPUs:

# Automatically use GPU if available
model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    device_map="auto"  # Default
)

Manual Device Selection

Specify the device explicitly:

import torch

# Check available devices
if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
    device = "cuda"
elif torch.backends.mps.is_available():  # Apple Silicon
    print("Apple Silicon GPU available")
    device = "mps"
else:
    print("Using CPU")
    device = "cpu"

# Load on specific device
model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    device_map=device
)

Multi-GPU Distribution

Automatically distribute model across multiple GPUs:

# Distribute across all available GPUs
model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    device_map="auto"  # Automatically distributes
)

Memory Optimization

Quantization

Reduce memory usage with quantization:

8-bit Quantization

Reduces memory by approximately 50%:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    load_in_8bit=True,
    device_map="auto"
)

Benefits:

  • ~50% memory reduction
  • Minimal accuracy loss
  • Faster inference

4-bit Quantization

Reduces memory by approximately 75%:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    load_in_4bit=True,
    device_map="auto"
)

Benefits:

  • ~75% memory reduction
  • Slight accuracy tradeoff
  • Significantly faster inference
  • Enables loading larger models
⚠️

Quantization Notes

  • Cannot use both load_in_8bit and load_in_4bit simultaneously
  • Quantization requires bitsandbytes library (included with vi-sdk[inference])
  • Some layers may not support quantization and will remain in original precision

Mixed Precision

Use FP16 for faster inference with minimal quality loss:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    dtype="float16",  # Half precision
    device_map="auto"
)

Supported dtypes:

  • "float32" - Full precision (default)
  • "float16" - Half precision (recommended for GPUs)
  • "bfloat16" - Brain float 16 (better range than float16)

Low CPU Memory Usage

Optimize CPU memory usage during loading:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

Advanced Loading Options

Flash Attention 2

Enable Flash Attention 2 for faster inference:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    attn_implementation="flash_attention_2"
)
ℹ️

Flash Attention 2 Requirements

  • Requires flash-attn package installed
  • Only supported on compatible GPUs (Ampere or newer)
  • Provides significant speedup for long sequences
  • Falls back to standard attention if not available

Complete Configuration Example

Combine multiple optimization options:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    # Device management
    device_map="auto",
    # Memory optimization
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    # Performance optimization
    attn_implementation="flash_attention_2",
    dtype="float16",
)

Model Caching

How Caching Works

Models are automatically cached locally after first download:

# First time - downloads model
model1 = ViModel(run_id="your-run-id")  # Downloads

# Second time - uses cached model
model2 = ViModel(run_id="your-run-id")  # Uses cache (faster)

Default cache location:

~/.datature/vi/models/
└── <run-id>/
    ├── model_full/          # Full model weights
    ├── adapter/             # Adapter weights (if available)
    └── run_config.json      # Training configuration

Custom Save Path

Specify a custom directory for cached models:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    save_path="./my_models"  # Custom cache directory
)

Force Re-download

Force a fresh download even if model is cached:

model = ViModel(
    secret_key="your-secret-key",
    organization_id="your-organization-id",
    run_id="your-run-id",
    overwrite=True  # Force re-download
)

When to use:

  • Model was updated after initial download
  • Cache corruption suspected
  • Testing model changes

Inspecting Models

Inspect Before Loading

Check model metadata without loading the full model:

from vi.inference import ViModel

# Inspect model metadata
info = ViModel.inspect(
    secret_key="your-key",
    organization_id="your-org",
    run_id="your-run-id"
)

print(f"Model: {info.model_name}")
print(f"Size: {info.size_gb:.2f} GB")
print(f"Architecture: {info.architecture}")
print(f"Task: {info.task_type}")

# Decide whether to load based on requirements
if info.size_gb < 10 and info.task_type == "VQA":
    # Load the model
    model = ViModel(
        secret_key="your-key",
        organization_id="your-org",
        run_id="your-run-id"
    )
else:
    print("Model doesn't meet requirements")

Benefits:

  • Fast - only reads configuration files
  • No GPU memory required
  • Helps plan resource requirements
  • Useful for automated workflows

Inspect Local Model

Inspect a model from local path:

# Inspect local or HuggingFace model
info = ViModel.inspect(
    pretrained_model_name_or_path="./path/to/model"
)

print(info)

Supported Model Architectures

The Vi SDK currently supports the following model architectures:

Qwen2.5-VL

Model class: Qwen2_5_VLForConditionalGeneration

model = ViModel(run_id="your-qwen-run-id")

Features:

  • Visual Question Answering (VQA)
  • Phrase Grounding
  • Multi-turn conversations
  • High-resolution image support

InternVL 3.5

Model class: InternVLForConditionalGeneration

model = ViModel(run_id="your-internvl-run-id")

Features:

  • Visual Question Answering (VQA)
  • Phrase Grounding
  • Multi-modal understanding
  • Long context support

Cosmos Reason1

Model class: Qwen2_5_VLForConditionalGeneration

model = ViModel(run_id="your-cosmos-run-id")

Features:

  • Visual Question Answering (VQA)
  • Phrase Grounding
  • Advanced reasoning
  • Complex visual understanding

NVILA

Architecture: NVILA

model = ViModel(run_id="your-nvila-run-id")

Features:

  • Visual Question Answering (VQA)
  • Phrase Grounding
  • Optimized for NVIDIA GPUs
  • Efficient inference
ℹ️

Coming Soon

The following model architectures will be supported in upcoming releases:

  • DeepSeek OCR — Specialized OCR model for document understanding
  • LLaVA-NeXT — Advanced multimodal reasoning with improved visual comprehension

Check the changelog for updates.


Loading Best Practices

1. Reuse Model Instances

Create model once and reuse for multiple inferences:

# ✅ Good - create once, reuse
model = ViModel(run_id="your-run-id")

for image in images:
    result, error = model(source=image)

# ❌ Bad - recreate for each image (wasteful)
for image in images:
    model = ViModel(run_id="your-run-id")  # Don't do this
    result, error = model(source=image)

2. Choose Appropriate Quantization

Balance memory usage and quality:

# For GPUs with limited memory
model = ViModel(run_id="your-run-id", load_in_8bit=True)

# For very limited memory or large models
model = ViModel(run_id="your-run-id", load_in_4bit=True)

# For best quality (if memory allows)
model = ViModel(run_id="your-run-id")  # No quantization

3. Use Appropriate Precision

Choose dtype based on your hardware:

# For modern GPUs (Ampere or newer)
model = ViModel(run_id="your-run-id", dtype="float16")

# For CPUs or older GPUs
model = ViModel(run_id="your-run-id", dtype="float32")

# For TPUs or specific architectures
model = ViModel(run_id="your-run-id", dtype="bfloat16")

4. Monitor Memory Usage

Track memory consumption:

import torch

# Load model
model = ViModel(run_id="your-run-id")

# Check memory usage
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")

5. Handle Loading Errors

Implement proper error handling:

try:
    model = ViModel(
        secret_key="your-secret-key",
        organization_id="your-organization-id",
        run_id="your-run-id"
    )
    print("Model loaded successfully")
except ValueError as e:
    print(f"Failed to load model: {e}")
    # Handle error appropriately
except Exception as e:
    print(f"Unexpected error: {e}")

Common Loading Scenarios

Development Environment

Optimize for fast iteration:

# Fast loading with caching
model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,  # Reduce memory
    low_cpu_mem_usage=True
)

Production Environment

Optimize for performance and reliability:

# Production-optimized loading
model = ViModel(
    secret_key=os.getenv("VI_SECRET_KEY"),
    organization_id=os.getenv("VI_ORG_ID"),
    run_id="your-run-id",
    device_map="auto",
    dtype="float16",
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True
)

Resource-Constrained Environment

Maximize memory efficiency:

# Maximum memory efficiency
model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,  # Maximum compression
    low_cpu_mem_usage=True,
    device_map="auto"
)

Multi-GPU Server

Distribute across GPUs:

# Automatic distribution across GPUs
model = ViModel(
    run_id="your-run-id",
    device_map="auto",  # Auto-distributes
    dtype="float16",
    attn_implementation="flash_attention_2"
)

Troubleshooting

Out of Memory During Loading

If you encounter OOM errors:

# Try 8-bit quantization first
model = ViModel(run_id="your-run-id", load_in_8bit=True)

# If still OOM, try 4-bit quantization
model = ViModel(run_id="your-run-id", load_in_4bit=True)

# Enable low CPU memory usage
model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,
    low_cpu_mem_usage=True
)

Model Download Fails

Check connection and credentials:

try:
    model = ViModel(
        secret_key="your-secret-key",
        organization_id="your-organization-id",
        run_id="your-run-id"
    )
except ValueError as e:
    if "Failed to download model" in str(e):
        print("Check your credentials and run ID")
        print("Ensure the model has finished training")

Slow Loading

Improve loading speed:

# Use cached models (default)
model = ViModel(run_id="your-run-id")  # Fast if cached

# Enable low CPU memory usage
model = ViModel(
    run_id="your-run-id",
    low_cpu_mem_usage=True
)

More troubleshooting →


See also