Improve Performance

Datature Vi inference performance depends on three factors: how the model loads, how you run predictions, and how you manage GPU memory. This page covers the key levers for each.

Before You Start

Learn how to run inference →

Memory management

Quantization

Quantization reduces model weight precision to cut memory usage. The tradeoff is a small accuracy reduction, which is acceptable for most production use cases.

8-bit quantization (~50% memory reduction)

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,
    device_map="auto"
)

Cuts memory by about half with minimal accuracy impact. Start here when your GPU has limited VRAM.

4-bit quantization (~75% memory reduction)

model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,
    device_map="auto"
)

Cuts memory by about three-quarters. There is a slight accuracy tradeoff, but this lets you run larger models on smaller GPUs.

Mixed precision

FP16 runs about 2x faster than FP32 and uses 50% less memory, with negligible quality change on modern GPUs.

model = ViModel(
    run_id="your-run-id",
    dtype="float16",
    device_map="auto"
)

Low CPU memory usage

Reduces peak CPU memory during model loading. Worth enabling when loading large models, especially with quantization.

model = ViModel(
    run_id="your-run-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

Clear GPU cache

For long-running batch jobs, clearing the GPU cache periodically prevents fragmentation from accumulating.

import torch
import gc

for i, image in enumerate(images):
    result, error = model(source=image)

    if i % 100 == 0:
        torch.cuda.empty_cache()
        gc.collect()

GPU utilization

Check GPU availability

import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA: {torch.version.cuda}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("No GPU available, using CPU")

Monitor GPU memory

def print_gpu_utilization():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"GPU Memory: Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")

print_gpu_utilization()  # Before loading
model = ViModel(run_id="your-run-id")
print_gpu_utilization()  # After loading

Flash Attention 2

Flash Attention 2 speeds up inference by 2-3x on compatible GPUs and reduces memory for long sequences. It requires the flash-attn package and an Ampere-or-newer GPU (RTX 30xx, A100, H100, etc.).

model = ViModel(
    run_id="your-run-id",
    attn_implementation="flash_attention_2",
    dtype="float16"
)

Multi-GPU distribution

device_map="auto" distributes the model across all available GPUs automatically, which is the right default for multi-GPU servers.

model = ViModel(
    run_id="your-run-id",
    device_map="auto"
)

Batch processing strategies

Use native batch inference

The SDK's native batch support is faster than running a loop of single-image calls. It handles progress tracking and individual error isolation.

# Good: native batch inference
results = model(
    source=["img1.jpg", "img2.jpg", "img3.jpg"],
    user_prompt="Describe this",
    show_progress=True
)

# Avoid: manual loop (slower, no progress tracking)
results = []
for img in ["img1.jpg", "img2.jpg", "img3.jpg"]:
    result, error = model(source=img)
    results.append((result, error))

Process folders directly

Folder paths avoid the overhead of building file lists manually.

# Good: pass folder path directly
results = model(source="./images/", recursive=True)

# Avoid: manual file listing
from pathlib import Path
images = list(Path("./images").glob("*.jpg"))
results = model(source=images)

Chunk large datasets

Process large datasets in chunks to keep GPU memory under control. Clear the cache between chunks.

import torch
from pathlib import Path

def process_in_chunks(model, image_dir, chunk_size=100):
    all_images = list(Path(image_dir).glob("*.jpg"))
    print(f"Processing {len(all_images)} images in chunks of {chunk_size}")

    all_results = []
    for i in range(0, len(all_images), chunk_size):
        chunk = all_images[i:i + chunk_size]
        results = model(
            source=chunk,
            user_prompt="Describe this",
            show_progress=True
        )
        all_results.extend(results)
        torch.cuda.empty_cache()

    return all_results

# Recommended chunk sizes by VRAM:
# 8 GB VRAM  → chunk_size=50
# 16 GB VRAM → chunk_size=100
# 24 GB+ VRAM → chunk_size=200

Model loading strategies

Reuse model instances

Creating a ViModel loads weights from disk into GPU memory. Do this once at startup, not inside a loop.

# Good: create once, reuse
model = ViModel(run_id="your-run-id")
for image in images:
    result, error = model(source=image)

# Avoid: recreating on each iteration
for image in images:
    model = ViModel(run_id="your-run-id")  # loads weights every time
    result, error = model(source=image)

Inspect before loading

Check model size before committing to a loading strategy. This reads only metadata files and requires no GPU memory.

info = ViModel.inspect(run_id="your-run-id")
print(f"Model size: {info.size_gb:.2f} GB")

if info.size_gb > 10:
    model = ViModel(run_id="your-run-id", load_in_8bit=True)
else:
    model = ViModel(run_id="your-run-id")

Use the model cache

After the first download, the SDK caches the model locally at ~/.datature/vi/models/. Subsequent loads are fast. Caching is on by default; you don't need to configure anything.

Code improvement patterns

Efficient error handling

# Good: check error status
result, error = model(source="image.jpg")
if error is None:
    print(result.result)
else:
    print(f"Failed: {error}")

# Avoid: using try/except for normal control flow
try:
    result = some_function()
except Exception:
    pass

Minimize attribute checks

result, error = model(source="image.jpg")
if error is None:
    has_groundings = hasattr(result, "result") and hasattr(result.result, "groundings")
    if has_groundings:
        for grounding in result.result.groundings:
            process(grounding)

Use list comprehensions

# Good: list comprehension
successful = [r for r, e in results if e is None]

# Avoid: manual append loop
successful = []
for r, e in results:
    if e is None:
        successful.append(r)

Performance benchmarking

Measure inference time

import time

def benchmark_inference(model, image_path, iterations=10):
    times = []
    model(source=image_path)  # warm-up

    for _ in range(iterations):
        start = time.time()
        model(source=image_path)
        times.append(time.time() - start)

    avg_time = sum(times) / len(times)
    print(f"Average: {avg_time:.3f}s")
    print(f"Throughput: {1 / avg_time:.1f} images/sec")

benchmark_inference(model, "test.jpg")

Compare configurations

import time

configs = [
    {"name": "Float32", "dtype": "float32"},
    {"name": "Float16", "dtype": "float16"},
    {"name": "8-bit", "load_in_8bit": True},
    {"name": "4-bit", "load_in_4bit": True},
]

for config in configs:
    name = config.pop("name")
    m = ViModel(run_id="your-run-id", **config)

    start = time.time()
    m(source="test.jpg")
    elapsed = time.time() - start

    print(f"{name}: {elapsed:.3f}s")

Production configuration

Recommended production configuration

import os

model = ViModel(
    secret_key=os.getenv("VI_SECRET_KEY"),
    organization_id=os.getenv("VI_ORG_ID"),
    run_id="your-run-id",
    device_map="auto",
    dtype="float16",
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True,
    load_in_8bit=True  # if memory is constrained
)

Disable progress display in automated pipelines

results = model(
    source=images,
    user_prompt="Describe this",
    show_progress=False
)

Load model once at application startup

from fastapi import FastAPI

app = FastAPI()
model = None

@app.on_event("startup")
async def load_model():
    global model
    model = ViModel(run_id="your-run-id")

@app.post("/predict")
async def predict(image_path: str):
    result, error = model(source=image_path)
    if error is None:
        return {"result": result.result}
    return {"error": str(error)}

Hardware recommendations

Tier
CPU
RAM
GPU VRAM
Storage
Minimum
4+ cores
16 GB
8 GB (with quantization)
50 GB
Production
8+ cores
32 GB
16 GB
100 GB SSD
Optimal
16+ cores
64 GB
24 GB+ or multi-GPU
NVMe SSD

For the optimal tier, NVMe storage reduces model cache load time on first inference.

Related resources

Load Models

Quantization options, device mapping, and loading from Datature Vi or HuggingFace.

Run Inference

Batch processing, chunked workflows, and error handling patterns.

Troubleshoot Issues

Fix OOM errors, slow inference, and other common performance problems.