Load Models
ViModel is the entry point for loading and running vision-language models (VLMs) in Datature Vi. It downloads models from Datature Vi or HuggingFace, caches them locally, and handles device placement automatically.
- Vi SDK installed with inference dependencies
- A trained model run ID (for Datature Vi models) or a HuggingFace model name
- Your secret key for Datature Vi models
Basic model loading
From Datature Vi
from vi.inference import ViModel
model = ViModel(
run_id="your-run-id",
secret_key="your-secret-key",
organization_id="your-organization-id"
)Store credentials as environment variables instead of hardcoding them:
export DATATURE_VI_SECRET_KEY="your-secret-key"
export DATATURE_VI_ORGANIZATION_ID="your-organization-id"model = ViModel(run_id="your-run-id")
# Reads credentials from DATATURE_VI_SECRET_KEY and DATATURE_VI_ORGANIZATION_IDFrom HuggingFace
model = ViModel(
pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct"
)From a local path
model = ViModel(
pretrained_model_name_or_path="./path/to/local/model"
)Parameters
Device management
Automatic device selection
By default, ViModel detects available GPUs and places the model there. Pass device_map="auto" explicitly or omit it; both behave the same way.
model = ViModel(
run_id="your-run-id",
device_map="auto" # default
)Manual device selection
import torch
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available(): # Apple Silicon
device = "mps"
else:
device = "cpu"
model = ViModel(run_id="your-run-id", device_map=device)Multi-GPU distribution
device_map="auto" distributes a large model across all available GPUs automatically.
model = ViModel(
run_id="your-run-id",
device_map="auto",
dtype="float16"
)Memory optimization
8-bit quantization
Reduces memory by ~50%. The best starting point for GPUs with limited VRAM.
model = ViModel(
run_id="your-run-id",
load_in_8bit=True,
device_map="auto"
)4-bit quantization
Reduces memory by ~75%. A slight accuracy tradeoff, but lets you run larger models on smaller GPUs.
model = ViModel(
run_id="your-run-id",
load_in_4bit=True,
device_map="auto"
)- You cannot set both
load_in_8bitandload_in_4bitat the same time. - Quantization requires the
bitsandbyteslibrary, which is included invi-sdk[inference]. - A few model layers may not support quantization and will stay in their original precision.
Mixed precision
float16 is faster and uses less memory than float32 on modern GPUs with negligible quality impact.
model = ViModel(
run_id="your-run-id",
dtype="float16",
device_map="auto"
)Supported dtypes: "float32" (default), "float16" (recommended on GPUs), "bfloat16" (better numeric range than float16)
Low CPU memory usage
Reduces peak CPU memory during the loading phase. Worth enabling when loading large models.
model = ViModel(
run_id="your-run-id",
low_cpu_mem_usage=True,
device_map="auto"
)Advanced loading options
Flash Attention 2
Flash Attention 2 speeds up inference on Ampere-or-newer GPUs.
model = ViModel(
run_id="your-run-id",
attn_implementation="flash_attention_2",
dtype="float16"
)Flash Attention 2 requires the flash-attn package. It only works on compatible GPUs (Ampere or newer) and falls back to standard attention if unavailable.
Full configuration example
model = ViModel(
run_id="your-run-id",
device_map="auto",
load_in_8bit=True,
low_cpu_mem_usage=True,
attn_implementation="flash_attention_2",
dtype="float16"
)Model caching
Downloaded models are cached locally at ~/.datature/vi/models/ by default. The directory layout looks like this:
~/.datature/vi/models/
└── <run-id>/
├── model_full/ # Full model weights
├── adapter/ # Adapter weights (if available)
└── run_config.json # Training configurationRe-initializing with the same run_id uses the cached copy and skips the download.
Custom save path
model = ViModel(
run_id="your-run-id",
save_path="./my_models"
)Force re-download
model = ViModel(
run_id="your-run-id",
overwrite=True
)Use overwrite=True when the model was updated after your initial download or if you suspect cache corruption.
Inspecting models
Check model metadata before loading the weights. This is fast: it only reads configuration files and requires no GPU memory.
from vi.inference import ViModel
info = ViModel.inspect(
secret_key="your-key",
organization_id="your-org",
run_id="your-run-id"
)
print(f"Model: {info.model_name}")
print(f"Size: {info.size_gb:.2f} GB")
print(f"Architecture: {info.architecture}")
print(f"Task: {info.task_type}")
# Decide loading strategy based on size
if info.size_gb > 10:
model = ViModel(run_id="your-run-id", load_in_8bit=True)
else:
model = ViModel(run_id="your-run-id")info = ViModel.inspect(
pretrained_model_name_or_path="./path/to/model"
)
print(info)Supported model architectures
DeepSeek OCR and LLaVA-NeXT are coming in future releases. Check the changelog for updates.
Loading best practices
Reuse model instances. Loading a model is expensive. Create it once and call it many times. Don't recreate it inside a loop.
# Good: create once, reuse
model = ViModel(run_id="your-run-id")
for image in images:
result, error = model(source=image)
# Bad: recreating for each image wastes time and memory
for image in images:
model = ViModel(run_id="your-run-id")
result, error = model(source=image)Choose quantization based on your hardware. 8-bit is the right default for most GPU setups. Switch to 4-bit if you're hitting memory limits. Use no quantization only when memory allows, as it gives the best accuracy.
Handle loading errors. Wrap the constructor in a try/except when credentials or network access might fail.
try:
model = ViModel(
run_id="your-run-id",
secret_key="your-secret-key",
organization_id="your-organization-id"
)
except ValueError as e:
print(f"Failed to load model: {e}")
except Exception as e:
print(f"Unexpected error: {e}")Loading scenarios
model = ViModel(
run_id="your-run-id",
load_in_8bit=True,
low_cpu_mem_usage=True
)import os
model = ViModel(
secret_key=os.getenv("VI_SECRET_KEY"),
organization_id=os.getenv("VI_ORG_ID"),
run_id="your-run-id",
device_map="auto",
dtype="float16",
attn_implementation="flash_attention_2",
low_cpu_mem_usage=True
)model = ViModel(
run_id="your-run-id",
load_in_4bit=True,
low_cpu_mem_usage=True,
device_map="auto"
)Related resources
Updated about 1 month ago
