Load Models

ViModel is the entry point for loading and running vision-language models (VLMs) in Datature Vi. It downloads models from Datature Vi or HuggingFace, caches them locally, and handles device placement automatically.

Before You Start

Basic model loading

From Datature Vi

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

Store credentials as environment variables instead of hardcoding them:

export DATATURE_VI_SECRET_KEY="your-secret-key"
export DATATURE_VI_ORGANIZATION_ID="your-organization-id"
model = ViModel(run_id="your-run-id")
# Reads credentials from DATATURE_VI_SECRET_KEY and DATATURE_VI_ORGANIZATION_ID

From HuggingFace

model = ViModel(
    pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct"
)

From a local path

model = ViModel(
    pretrained_model_name_or_path="./path/to/local/model"
)

Parameters

ViModel: Parameters

Name
Type
Description
Required
Default
run_id
string
Run ID of a trained Datature Vi model. Required when loading from Datature Vi.
Optional
secret_key
string
Datature Vi API secret key. Falls back to DATATURE_VI_SECRET_KEY environment variable.
Optional
organization_id
string
Datature Vi organization ID. Falls back to DATATURE_VI_ORGANIZATION_ID environment variable.
Optional
pretrained_model_name_or_path
string
HuggingFace model name (e.g. 'Qwen/Qwen2.5-VL-7B-Instruct') or local directory path.
Optional
device_map
string
Device placement strategy. Use 'auto' for automatic GPU detection, 'cuda' for a specific GPU, 'cpu' to force CPU.
Optional
"auto"
load_in_8bit
boolean
Load model in 8-bit quantization. Cuts memory ~50% with minimal accuracy loss.
Optional
false
load_in_4bit
boolean
Load model in 4-bit quantization. Cuts memory ~75% with a slight accuracy tradeoff.
Optional
false
dtype
string
Model weight precision. Options: 'float32', 'float16', 'bfloat16'. Use 'float16' on modern GPUs.
Optional
"float32"
attn_implementation
string
Attention implementation. Set to 'flash_attention_2' for faster inference on Ampere+ GPUs (requires flash-attn package).
Optional
low_cpu_mem_usage
boolean
Reduce CPU memory during model loading. Recommended when loading large models.
Optional
false
save_path
string
Custom directory for caching downloaded models. Defaults to ~/.datature/vi/models/.
Optional
overwrite
boolean
Force a fresh download even if the model is already cached.
Optional
false

Device management

Automatic device selection

By default, ViModel detects available GPUs and places the model there. Pass device_map="auto" explicitly or omit it; both behave the same way.

model = ViModel(
    run_id="your-run-id",
    device_map="auto"  # default
)

Manual device selection

import torch

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():  # Apple Silicon
    device = "mps"
else:
    device = "cpu"

model = ViModel(run_id="your-run-id", device_map=device)

Multi-GPU distribution

device_map="auto" distributes a large model across all available GPUs automatically.

model = ViModel(
    run_id="your-run-id",
    device_map="auto",
    dtype="float16"
)

Memory optimization

8-bit quantization

Reduces memory by ~50%. The best starting point for GPUs with limited VRAM.

model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,
    device_map="auto"
)

4-bit quantization

Reduces memory by ~75%. A slight accuracy tradeoff, but lets you run larger models on smaller GPUs.

model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,
    device_map="auto"
)
Quantization Notes
  • You cannot set both load_in_8bit and load_in_4bit at the same time.
  • Quantization requires the bitsandbytes library, which is included in vi-sdk[inference].
  • A few model layers may not support quantization and will stay in their original precision.

Mixed precision

float16 is faster and uses less memory than float32 on modern GPUs with negligible quality impact.

model = ViModel(
    run_id="your-run-id",
    dtype="float16",
    device_map="auto"
)

Supported dtypes: "float32" (default), "float16" (recommended on GPUs), "bfloat16" (better numeric range than float16)

Low CPU memory usage

Reduces peak CPU memory during the loading phase. Worth enabling when loading large models.

model = ViModel(
    run_id="your-run-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

Advanced loading options

Flash Attention 2

Flash Attention 2 speeds up inference on Ampere-or-newer GPUs.

model = ViModel(
    run_id="your-run-id",
    attn_implementation="flash_attention_2",
    dtype="float16"
)

Flash Attention 2 requires the flash-attn package. It only works on compatible GPUs (Ampere or newer) and falls back to standard attention if unavailable.

Full configuration example

model = ViModel(
    run_id="your-run-id",
    device_map="auto",
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    attn_implementation="flash_attention_2",
    dtype="float16"
)

Model caching

Downloaded models are cached locally at ~/.datature/vi/models/ by default. The directory layout looks like this:

~/.datature/vi/models/
└── <run-id>/
    ├── model_full/        # Full model weights
    ├── adapter/           # Adapter weights (if available)
    └── run_config.json    # Training configuration

Re-initializing with the same run_id uses the cached copy and skips the download.

Custom save path

model = ViModel(
    run_id="your-run-id",
    save_path="./my_models"
)

Force re-download

model = ViModel(
    run_id="your-run-id",
    overwrite=True
)

Use overwrite=True when the model was updated after your initial download or if you suspect cache corruption.

Inspecting models

Check model metadata before loading the weights. This is fast: it only reads configuration files and requires no GPU memory.

from vi.inference import ViModel

info = ViModel.inspect(
    secret_key="your-key",
    organization_id="your-org",
    run_id="your-run-id"
)

print(f"Model: {info.model_name}")
print(f"Size: {info.size_gb:.2f} GB")
print(f"Architecture: {info.architecture}")
print(f"Task: {info.task_type}")

# Decide loading strategy based on size
if info.size_gb > 10:
    model = ViModel(run_id="your-run-id", load_in_8bit=True)
else:
    model = ViModel(run_id="your-run-id")
info = ViModel.inspect(
    pretrained_model_name_or_path="./path/to/model"
)
print(info)

Supported model architectures

Supported model architectures

Model
Architecture class
Task support
Qwen2.5-VL
Qwen2_5_VLForConditionalGeneration
VQA, Phrase Grounding, multi-turn
InternVL 3.5
InternVLForConditionalGeneration
VQA, Phrase Grounding, long context
Cosmos Reason1
Qwen2_5_VLForConditionalGeneration
VQA, Phrase Grounding, advanced reasoning
NVILA
NVILA
VQA, Phrase Grounding, NVIDIA GPU optimized

DeepSeek OCR and LLaVA-NeXT are coming in future releases. Check the changelog for updates.

Loading best practices

Reuse model instances. Loading a model is expensive. Create it once and call it many times. Don't recreate it inside a loop.

# Good: create once, reuse
model = ViModel(run_id="your-run-id")
for image in images:
    result, error = model(source=image)

# Bad: recreating for each image wastes time and memory
for image in images:
    model = ViModel(run_id="your-run-id")
    result, error = model(source=image)

Choose quantization based on your hardware. 8-bit is the right default for most GPU setups. Switch to 4-bit if you're hitting memory limits. Use no quantization only when memory allows, as it gives the best accuracy.

Handle loading errors. Wrap the constructor in a try/except when credentials or network access might fail.

try:
    model = ViModel(
        run_id="your-run-id",
        secret_key="your-secret-key",
        organization_id="your-organization-id"
    )
except ValueError as e:
    print(f"Failed to load model: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Loading scenarios

model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,
    low_cpu_mem_usage=True
)
import os

model = ViModel(
    secret_key=os.getenv("VI_SECRET_KEY"),
    organization_id=os.getenv("VI_ORG_ID"),
    run_id="your-run-id",
    device_map="auto",
    dtype="float16",
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True
)
model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,
    low_cpu_mem_usage=True,
    device_map="auto"
)

Related resources

Run Inference

Single-image and batch inference, streaming, error handling, and common workflows.

Improve Performance

GPU utilization, batching strategies, and hardware recommendations.

Troubleshoot Issues

Fix OOM errors, slow loading, failed downloads, and other common problems.