Load Models
Load models
Load trained vision-language models from Datature Vi or HuggingFace for inference.
Overview
The Vi SDK provides the ViModel class as the recommended way to load models and run inference. It handles:
- Automatic downloading - Downloads models from Datature Vi
- Caching - Caches models locally to avoid re-downloading
- Device management - Automatically detects and uses available GPUs
- Memory optimization - Supports quantization and efficient loading
- Error handling - Graceful error handling and validation
Basic Model Loading
From Datature Vi
Load a model trained on Datature Vi using your credentials and run ID:
from vi.inference import ViModel
# Load model with credentials and run ID
model = ViModel(
run_id="your-run-id",
secret_key="your-secret-key",
organization_id="your-organization-id"
)
Using Environment VariablesFor better security, set credentials as environment variables:
export DATATURE_VI_SECRET_KEY="your-secret-key" export DATATURE_VI_ORGANIZATION_ID="your-organization-id"Then initialize without explicit credentials:
model = ViModel(run_id="your-run-id")
From HuggingFace
Load models directly from the HuggingFace model hub:
# Load from HuggingFace model hub
model = ViModel(
pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct"
)From Local Path
Load models from a local directory:
# Load from local path
model = ViModel(
pretrained_model_name_or_path="./path/to/local/model"
)Device Management
Automatic Device Selection
By default, ViModel automatically detects and uses available GPUs:
# Automatically use GPU if available
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
device_map="auto" # Default
)Manual Device Selection
Specify the device explicitly:
import torch
# Check available devices
if torch.cuda.is_available():
print(f"GPU available: {torch.cuda.get_device_name(0)}")
device = "cuda"
elif torch.backends.mps.is_available(): # Apple Silicon
print("Apple Silicon GPU available")
device = "mps"
else:
print("Using CPU")
device = "cpu"
# Load on specific device
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
device_map=device
)Multi-GPU Distribution
Automatically distribute model across multiple GPUs:
# Distribute across all available GPUs
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
device_map="auto" # Automatically distributes
)Memory Optimization
Quantization
Reduce memory usage with quantization:
8-bit Quantization
Reduces memory by approximately 50%:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
load_in_8bit=True,
device_map="auto"
)Benefits:
- ~50% memory reduction
- Minimal accuracy loss
- Faster inference
4-bit Quantization
Reduces memory by approximately 75%:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
load_in_4bit=True,
device_map="auto"
)Benefits:
- ~75% memory reduction
- Slight accuracy tradeoff
- Significantly faster inference
- Enables loading larger models
Quantization Notes
- Cannot use both
load_in_8bitandload_in_4bitsimultaneously- Quantization requires
bitsandbyteslibrary (included withvi-sdk[inference])- Some layers may not support quantization and will remain in original precision
Mixed Precision
Use FP16 for faster inference with minimal quality loss:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
dtype="float16", # Half precision
device_map="auto"
)Supported dtypes:
"float32"- Full precision (default)"float16"- Half precision (recommended for GPUs)"bfloat16"- Brain float 16 (better range than float16)
Low CPU Memory Usage
Optimize CPU memory usage during loading:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
low_cpu_mem_usage=True,
device_map="auto"
)Advanced Loading Options
Flash Attention 2
Enable Flash Attention 2 for faster inference:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
attn_implementation="flash_attention_2"
)
Flash Attention 2 Requirements
- Requires
flash-attnpackage installed- Only supported on compatible GPUs (Ampere or newer)
- Provides significant speedup for long sequences
- Falls back to standard attention if not available
Complete Configuration Example
Combine multiple optimization options:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
# Device management
device_map="auto",
# Memory optimization
load_in_8bit=True,
low_cpu_mem_usage=True,
# Performance optimization
attn_implementation="flash_attention_2",
dtype="float16",
)Model Caching
How Caching Works
Models are automatically cached locally after first download:
# First time - downloads model
model1 = ViModel(run_id="your-run-id") # Downloads
# Second time - uses cached model
model2 = ViModel(run_id="your-run-id") # Uses cache (faster)Default cache location:
~/.datature/vi/models/
└── <run-id>/
├── model_full/ # Full model weights
├── adapter/ # Adapter weights (if available)
└── run_config.json # Training configurationCustom Save Path
Specify a custom directory for cached models:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
save_path="./my_models" # Custom cache directory
)Force Re-download
Force a fresh download even if model is cached:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id",
overwrite=True # Force re-download
)When to use:
- Model was updated after initial download
- Cache corruption suspected
- Testing model changes
Inspecting Models
Inspect Before Loading
Check model metadata without loading the full model:
from vi.inference import ViModel
# Inspect model metadata
info = ViModel.inspect(
secret_key="your-key",
organization_id="your-org",
run_id="your-run-id"
)
print(f"Model: {info.model_name}")
print(f"Size: {info.size_gb:.2f} GB")
print(f"Architecture: {info.architecture}")
print(f"Task: {info.task_type}")
# Decide whether to load based on requirements
if info.size_gb < 10 and info.task_type == "VQA":
# Load the model
model = ViModel(
secret_key="your-key",
organization_id="your-org",
run_id="your-run-id"
)
else:
print("Model doesn't meet requirements")Benefits:
- Fast - only reads configuration files
- No GPU memory required
- Helps plan resource requirements
- Useful for automated workflows
Inspect Local Model
Inspect a model from local path:
# Inspect local or HuggingFace model
info = ViModel.inspect(
pretrained_model_name_or_path="./path/to/model"
)
print(info)Supported Model Architectures
The Vi SDK currently supports the following model architectures:
Qwen2.5-VL
Model class: Qwen2_5_VLForConditionalGeneration
model = ViModel(run_id="your-qwen-run-id")Features:
- Visual Question Answering (VQA)
- Phrase Grounding
- Multi-turn conversations
- High-resolution image support
InternVL 3.5
Model class: InternVLForConditionalGeneration
model = ViModel(run_id="your-internvl-run-id")Features:
- Visual Question Answering (VQA)
- Phrase Grounding
- Multi-modal understanding
- Long context support
Cosmos Reason1
Model class: Qwen2_5_VLForConditionalGeneration
model = ViModel(run_id="your-cosmos-run-id")Features:
- Visual Question Answering (VQA)
- Phrase Grounding
- Advanced reasoning
- Complex visual understanding
NVILA
Architecture: NVILA
model = ViModel(run_id="your-nvila-run-id")Features:
- Visual Question Answering (VQA)
- Phrase Grounding
- Optimized for NVIDIA GPUs
- Efficient inference
Coming SoonThe following model architectures will be supported in upcoming releases:
- DeepSeek OCR — Specialized OCR model for document understanding
- LLaVA-NeXT — Advanced multimodal reasoning with improved visual comprehension
Check the changelog for updates.
Loading Best Practices
1. Reuse Model Instances
Create model once and reuse for multiple inferences:
# ✅ Good - create once, reuse
model = ViModel(run_id="your-run-id")
for image in images:
result, error = model(source=image)
# ❌ Bad - recreate for each image (wasteful)
for image in images:
model = ViModel(run_id="your-run-id") # Don't do this
result, error = model(source=image)2. Choose Appropriate Quantization
Balance memory usage and quality:
# For GPUs with limited memory
model = ViModel(run_id="your-run-id", load_in_8bit=True)
# For very limited memory or large models
model = ViModel(run_id="your-run-id", load_in_4bit=True)
# For best quality (if memory allows)
model = ViModel(run_id="your-run-id") # No quantization3. Use Appropriate Precision
Choose dtype based on your hardware:
# For modern GPUs (Ampere or newer)
model = ViModel(run_id="your-run-id", dtype="float16")
# For CPUs or older GPUs
model = ViModel(run_id="your-run-id", dtype="float32")
# For TPUs or specific architectures
model = ViModel(run_id="your-run-id", dtype="bfloat16")4. Monitor Memory Usage
Track memory consumption:
import torch
# Load model
model = ViModel(run_id="your-run-id")
# Check memory usage
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")5. Handle Loading Errors
Implement proper error handling:
try:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id"
)
print("Model loaded successfully")
except ValueError as e:
print(f"Failed to load model: {e}")
# Handle error appropriately
except Exception as e:
print(f"Unexpected error: {e}")Common Loading Scenarios
Development Environment
Optimize for fast iteration:
# Fast loading with caching
model = ViModel(
run_id="your-run-id",
load_in_8bit=True, # Reduce memory
low_cpu_mem_usage=True
)Production Environment
Optimize for performance and reliability:
# Production-optimized loading
model = ViModel(
secret_key=os.getenv("VI_SECRET_KEY"),
organization_id=os.getenv("VI_ORG_ID"),
run_id="your-run-id",
device_map="auto",
dtype="float16",
attn_implementation="flash_attention_2",
low_cpu_mem_usage=True
)Resource-Constrained Environment
Maximize memory efficiency:
# Maximum memory efficiency
model = ViModel(
run_id="your-run-id",
load_in_4bit=True, # Maximum compression
low_cpu_mem_usage=True,
device_map="auto"
)Multi-GPU Server
Distribute across GPUs:
# Automatic distribution across GPUs
model = ViModel(
run_id="your-run-id",
device_map="auto", # Auto-distributes
dtype="float16",
attn_implementation="flash_attention_2"
)Troubleshooting
Out of Memory During Loading
If you encounter OOM errors:
# Try 8-bit quantization first
model = ViModel(run_id="your-run-id", load_in_8bit=True)
# If still OOM, try 4-bit quantization
model = ViModel(run_id="your-run-id", load_in_4bit=True)
# Enable low CPU memory usage
model = ViModel(
run_id="your-run-id",
load_in_4bit=True,
low_cpu_mem_usage=True
)Model Download Fails
Check connection and credentials:
try:
model = ViModel(
secret_key="your-secret-key",
organization_id="your-organization-id",
run_id="your-run-id"
)
except ValueError as e:
if "Failed to download model" in str(e):
print("Check your credentials and run ID")
print("Ensure the model has finished training")Slow Loading
Improve loading speed:
# Use cached models (default)
model = ViModel(run_id="your-run-id") # Fast if cached
# Enable low CPU memory usage
model = ViModel(
run_id="your-run-id",
low_cpu_mem_usage=True
)See also
- Inference Overview — Getting started with inference
- Running Inference — Execute predictions on images
- Task Types — VQA and Phrase Grounding explained
- Optimization — Performance best practices
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
