Optimize Performance

Optimize performance

Best practices for optimizing inference performance, memory usage, and throughput with the Vi SDK.

📋
Prerequisites

A loaded model ready for inference

Understanding of model loading and inference basics

GPU hardware (optional but recommended for optimal performance)

Familiarity with batch processing

Learn how to run inference →

Overview

Key optimization strategies for VLM inference:

Memory optimization — Reduce memory footprint with quantization
GPU utilization — Maximize compute efficiency with Flash Attention
Batch processing — Improve throughput with batching strategies
Model loading — Faster initialization with device mapping
Code patterns — Efficient implementation with best practices

Memory Optimization

Quantization

Reduce memory usage with minimal quality loss:

8-bit Quantization (~50% reduction)

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,
    device_map="auto"
)

Benefits:

~50% memory reduction
Minimal accuracy loss
Faster inference
Recommended for most use cases

4-bit Quantization (~75% reduction)

model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,
    device_map="auto"
)

Benefits:

~75% memory reduction
Slight accuracy tradeoff
Enables larger models
Best for limited VRAM

Mixed Precision

Use FP16 for faster inference:

model = ViModel(
    run_id="your-run-id",
    dtype="float16",
    device_map="auto"
)

Benefits:

2x faster than FP32
50% less memory
Minimal quality loss
Supported on modern GPUs

Low CPU Memory Usage

Optimize CPU memory during loading:

model = ViModel(
    run_id="your-run-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

Clear GPU Cache

Free unused memory periodically:

import torch
import gc

# After batch processing
torch.cuda.empty_cache()
gc.collect()

# In processing loop
for i, image in enumerate(images):
    result, error = model(source=image)

    if i % 100 == 0:
        torch.cuda.empty_cache()
        gc.collect()

GPU Utilization

Check GPU Availability

import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("No GPU available")

Monitor GPU Usage

def print_gpu_utilization():
    """Print current GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")

# Check before loading
print_gpu_utilization()

# Load model
model = ViModel(run_id="your-run-id")

# Check after loading
print_gpu_utilization()

Flash Attention 2

Enable Flash Attention 2 for faster inference:

model = ViModel(
    run_id="your-run-id",
    attn_implementation="flash_attention_2",
    dtype="float16"
)

Benefits:

Up to 2-3x faster
Lower memory usage
Better scaling for long sequences
Requires compatible GPU (Ampere or newer)

Multi-GPU Distribution

Automatically distribute across GPUs:

model = ViModel(
    run_id="your-run-id",
    device_map="auto"  # Automatically distributes
)

Batch Processing Optimization

Use Native Batch Inference

# ✅ Good - native batch inference
results = model(
    source=["img1.jpg", "img2.jpg", "img3.jpg"],
    user_prompt="Describe this",
    show_progress=True
)

# ❌ Bad - manual loop
results = []
for img in ["img1.jpg", "img2.jpg", "img3.jpg"]:
    result, error = model(source=img)
    results.append((result, error))

Process Folders Directly

# ✅ Good - direct folder processing
results = model(
    source="./images/",
    recursive=True,
    show_progress=True
)

# ❌ Bad - manual file listing
from pathlib import Path
images = list(Path("./images").glob("*.jpg"))
results = model(source=images)

Optimal Batch Sizes

Balance speed and memory:

def process_in_chunks(model, images, chunk_size=100):
    """Process images in optimal chunks."""
    all_results = []

    for i in range(0, len(images), chunk_size):
        chunk = images[i:i+chunk_size]
        results = model(
            source=chunk,
            user_prompt="Describe this",
            show_progress=True
        )
        all_results.extend(results)

        # Clear cache between chunks
        torch.cuda.empty_cache()

    return all_results

# Recommended chunk sizes by VRAM
# 8GB: chunk_size=50
# 16GB: chunk_size=100
# 24GB+: chunk_size=200

Model Loading Optimization

Reuse Model Instances

# ✅ Good - create once, reuse
model = ViModel(run_id="your-run-id")
for image in images:
    result, error = model(source=image)

# ❌ Bad - recreate each time
for image in images:
    model = ViModel(run_id="your-run-id")  # Wasteful!
    result, error = model(source=image)

Use Cached Models

# First time - downloads model
model1 = ViModel(run_id="your-run-id")  # Slow

# Subsequent times - uses cache
model2 = ViModel(run_id="your-run-id")  # Fast

Inspect Before Loading

Check model requirements first:

# Inspect without loading
info = ViModel.inspect(run_id="your-run-id")

print(f"Model size: {info.size_gb:.2f} GB")

# Decide loading strategy
if info.size_gb > 10:
    # Use quantization
    model = ViModel(run_id="your-run-id", load_in_8bit=True)
else:
    # Load normally
    model = ViModel(run_id="your-run-id")

Code Optimization Patterns

Efficient Error Handling

# ✅ Good - check error status
result, error = model(source="image.jpg")
if error is None:
    print(result.caption)
else:
    logging.error(f"Failed: {error}")

# ❌ Bad - try/except for expected flow
try:
    result = some_function()
except Exception:
    pass

Minimize Attribute Checks

# ✅ Good - check once
result, error = model(source="image.jpg")
if error is None:
    has_phrases = hasattr(result, 'grounded_phrases')
    if has_phrases:
        for phrase in result.grounded_phrases:
            process(phrase)

# ❌ Bad - check repeatedly
for phrase in result.grounded_phrases if hasattr(result, 'grounded_phrases') else []:
    if hasattr(result, 'grounded_phrases'):  # Redundant!
        process(phrase)

Use List Comprehensions

# ✅ Good - list comprehension
successful = [r for r, e in results if e is None]

# ❌ Bad - manual loop
successful = []
for r, e in results:
    if e is None:
        successful.append(r)

Performance Benchmarking

Measure Inference Time

import time

def benchmark_inference(model, image_path, iterations=10):
    """Benchmark inference performance."""
    times = []

    # Warm-up
    model(source=image_path)

    # Measure
    for _ in range(iterations):
        start = time.time()
        result, error = model(source=image_path)
        elapsed = time.time() - start
        times.append(elapsed)

    avg_time = sum(times) / len(times)
    print(f"Average inference time: {avg_time:.3f}s")
    print(f"Throughput: {1/avg_time:.1f} images/sec")

benchmark_inference(model, "test.jpg")

Compare Configurations

def compare_configs():
    """Compare different model configurations."""
    configs = [
        {"name": "Float32", "dtype": "float32"},
        {"name": "Float16", "dtype": "float16"},
        {"name": "8-bit", "load_in_8bit": True},
        {"name": "4-bit", "load_in_4bit": True},
    ]

    for config in configs:
        name = config.pop("name")
        model = ViModel(run_id="your-run-id", **config)

        start = time.time()
        result, error = model(source="test.jpg")
        elapsed = time.time() - start

        print(f"{name}: {elapsed:.3f}s")

compare_configs()

Production Optimization

Configuration for Production

model = ViModel(
    secret_key=os.getenv("VI_SECRET_KEY"),
    organization_id=os.getenv("VI_ORG_ID"),
    run_id="your-run-id",
    # Optimize for production
    device_map="auto",
    dtype="float16",
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True,
    load_in_8bit=True  # If memory constrained
)

Disable Progress for Automation

# In production/automated scripts
results = model(
    source=images,
    user_prompt="Describe this",
    show_progress=False  # Reduce overhead
)

Connection Pooling

Reuse model instances across requests:

# FastAPI example
from fastapi import FastAPI

app = FastAPI()

# Load model once at startup
@app.on_event("startup")
async def load_model():
    global model
    model = ViModel(run_id="your-run-id")

@app.post("/predict")
async def predict(image_path: str):
    result, error = model(source=image_path)
    if error is None:
        return {"caption": result.caption}
    return {"error": str(error)}

Best Practices Summary

1. Memory Management

Use quantization (8-bit or 4-bit)
Enable low_cpu_mem_usage=True
Clear GPU cache periodically
Process in chunks for large datasets

2. GPU Utilization

Use dtype="float16" on modern GPUs
Enable Flash Attention 2 if available
Use device_map="auto" for multi-GPU
Monitor GPU memory usage

3. Batch Processing

Use native batch inference
Process folders directly
Choose appropriate chunk sizes
Disable progress in automation

4. Code Efficiency

Reuse model instances
Use list comprehensions
Minimize attribute checks
Handle errors efficiently

5. Production Deployment

Load model at startup
Use environment variables
Disable unnecessary features
Implement proper monitoring

Hardware Recommendations

Minimum Requirements

CPU: 4+ cores
RAM: 16GB+
GPU: 8GB VRAM (with quantization)
Storage: 50GB+ free space

Recommended for Production

CPU: 8+ cores
RAM: 32GB+
GPU: 16GB+ VRAM (NVIDIA A100, RTX 4090, etc.)
Storage: 100GB+ SSD

Optimal Configuration

CPU: 16+ cores
RAM: 64GB+
GPU: 24GB+ VRAM or multi-GPU setup
Storage: NVMe SSD for model cache

Related resources

Vi SDK inference — Getting started with inference
Load models — Load models with quantization and device mapping
Run inference — Execute single and batch predictions efficiently
Handle results — Process and visualize predictions
Configure generation — Control output temperature and sampling
Troubleshoot issues — Common performance problems and solutions
Download a model — Export models for deployment
Train a model — Learn about training workflows
Vi SDK getting started — Quick start guide for the Vi SDK

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects