Optimize Performance

Optimize performance

Best practices for optimizing inference performance, memory usage, and throughput with the Vi SDK.

📋

Prerequisites

Learn how to run inference →

Overview

Key optimization strategies for VLM inference:


Memory Optimization

Quantization

Reduce memory usage with minimal quality loss:

8-bit Quantization (~50% reduction)

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,
    device_map="auto"
)

Benefits:

  • ~50% memory reduction
  • Minimal accuracy loss
  • Faster inference
  • Recommended for most use cases

4-bit Quantization (~75% reduction)

model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,
    device_map="auto"
)

Benefits:

  • ~75% memory reduction
  • Slight accuracy tradeoff
  • Enables larger models
  • Best for limited VRAM

Mixed Precision

Use FP16 for faster inference:

model = ViModel(
    run_id="your-run-id",
    dtype="float16",
    device_map="auto"
)

Benefits:

  • 2x faster than FP32
  • 50% less memory
  • Minimal quality loss
  • Supported on modern GPUs

Low CPU Memory Usage

Optimize CPU memory during loading:

model = ViModel(
    run_id="your-run-id",
    low_cpu_mem_usage=True,
    device_map="auto"
)

Clear GPU Cache

Free unused memory periodically:

import torch
import gc

# After batch processing
torch.cuda.empty_cache()
gc.collect()

# In processing loop
for i, image in enumerate(images):
    result, error = model(source=image)

    if i % 100 == 0:
        torch.cuda.empty_cache()
        gc.collect()

GPU Utilization

Check GPU Availability

import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("No GPU available")

Monitor GPU Usage

def print_gpu_utilization():
    """Print current GPU memory usage."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")

# Check before loading
print_gpu_utilization()

# Load model
model = ViModel(run_id="your-run-id")

# Check after loading
print_gpu_utilization()

Flash Attention 2

Enable Flash Attention 2 for faster inference:

model = ViModel(
    run_id="your-run-id",
    attn_implementation="flash_attention_2",
    dtype="float16"
)

Benefits:

  • Up to 2-3x faster
  • Lower memory usage
  • Better scaling for long sequences
  • Requires compatible GPU (Ampere or newer)

Multi-GPU Distribution

Automatically distribute across GPUs:

model = ViModel(
    run_id="your-run-id",
    device_map="auto"  # Automatically distributes
)

Batch Processing Optimization

Use Native Batch Inference

# ✅ Good - native batch inference
results = model(
    source=["img1.jpg", "img2.jpg", "img3.jpg"],
    user_prompt="Describe this",
    show_progress=True
)

# ❌ Bad - manual loop
results = []
for img in ["img1.jpg", "img2.jpg", "img3.jpg"]:
    result, error = model(source=img)
    results.append((result, error))

Process Folders Directly

# ✅ Good - direct folder processing
results = model(
    source="./images/",
    recursive=True,
    show_progress=True
)

# ❌ Bad - manual file listing
from pathlib import Path
images = list(Path("./images").glob("*.jpg"))
results = model(source=images)

Optimal Batch Sizes

Balance speed and memory:

def process_in_chunks(model, images, chunk_size=100):
    """Process images in optimal chunks."""
    all_results = []

    for i in range(0, len(images), chunk_size):
        chunk = images[i:i+chunk_size]
        results = model(
            source=chunk,
            user_prompt="Describe this",
            show_progress=True
        )
        all_results.extend(results)

        # Clear cache between chunks
        torch.cuda.empty_cache()

    return all_results

# Recommended chunk sizes by VRAM
# 8GB: chunk_size=50
# 16GB: chunk_size=100
# 24GB+: chunk_size=200

Model Loading Optimization

Reuse Model Instances

# ✅ Good - create once, reuse
model = ViModel(run_id="your-run-id")
for image in images:
    result, error = model(source=image)

# ❌ Bad - recreate each time
for image in images:
    model = ViModel(run_id="your-run-id")  # Wasteful!
    result, error = model(source=image)

Use Cached Models

# First time - downloads model
model1 = ViModel(run_id="your-run-id")  # Slow

# Subsequent times - uses cache
model2 = ViModel(run_id="your-run-id")  # Fast

Inspect Before Loading

Check model requirements first:

# Inspect without loading
info = ViModel.inspect(run_id="your-run-id")

print(f"Model size: {info.size_gb:.2f} GB")

# Decide loading strategy
if info.size_gb > 10:
    # Use quantization
    model = ViModel(run_id="your-run-id", load_in_8bit=True)
else:
    # Load normally
    model = ViModel(run_id="your-run-id")

Code Optimization Patterns

Efficient Error Handling

# ✅ Good - check error status
result, error = model(source="image.jpg")
if error is None:
    print(result.caption)
else:
    logging.error(f"Failed: {error}")

# ❌ Bad - try/except for expected flow
try:
    result = some_function()
except Exception:
    pass

Minimize Attribute Checks

# ✅ Good - check once
result, error = model(source="image.jpg")
if error is None:
    has_phrases = hasattr(result, 'grounded_phrases')
    if has_phrases:
        for phrase in result.grounded_phrases:
            process(phrase)

# ❌ Bad - check repeatedly
for phrase in result.grounded_phrases if hasattr(result, 'grounded_phrases') else []:
    if hasattr(result, 'grounded_phrases'):  # Redundant!
        process(phrase)

Use List Comprehensions

# ✅ Good - list comprehension
successful = [r for r, e in results if e is None]

# ❌ Bad - manual loop
successful = []
for r, e in results:
    if e is None:
        successful.append(r)

Performance Benchmarking

Measure Inference Time

import time

def benchmark_inference(model, image_path, iterations=10):
    """Benchmark inference performance."""
    times = []

    # Warm-up
    model(source=image_path)

    # Measure
    for _ in range(iterations):
        start = time.time()
        result, error = model(source=image_path)
        elapsed = time.time() - start
        times.append(elapsed)

    avg_time = sum(times) / len(times)
    print(f"Average inference time: {avg_time:.3f}s")
    print(f"Throughput: {1/avg_time:.1f} images/sec")

benchmark_inference(model, "test.jpg")

Compare Configurations

def compare_configs():
    """Compare different model configurations."""
    configs = [
        {"name": "Float32", "dtype": "float32"},
        {"name": "Float16", "dtype": "float16"},
        {"name": "8-bit", "load_in_8bit": True},
        {"name": "4-bit", "load_in_4bit": True},
    ]

    for config in configs:
        name = config.pop("name")
        model = ViModel(run_id="your-run-id", **config)

        start = time.time()
        result, error = model(source="test.jpg")
        elapsed = time.time() - start

        print(f"{name}: {elapsed:.3f}s")

compare_configs()

Production Optimization

Configuration for Production

model = ViModel(
    secret_key=os.getenv("VI_SECRET_KEY"),
    organization_id=os.getenv("VI_ORG_ID"),
    run_id="your-run-id",
    # Optimize for production
    device_map="auto",
    dtype="float16",
    attn_implementation="flash_attention_2",
    low_cpu_mem_usage=True,
    load_in_8bit=True  # If memory constrained
)

Disable Progress for Automation

# In production/automated scripts
results = model(
    source=images,
    user_prompt="Describe this",
    show_progress=False  # Reduce overhead
)

Connection Pooling

Reuse model instances across requests:

# FastAPI example
from fastapi import FastAPI

app = FastAPI()

# Load model once at startup
@app.on_event("startup")
async def load_model():
    global model
    model = ViModel(run_id="your-run-id")

@app.post("/predict")
async def predict(image_path: str):
    result, error = model(source=image_path)
    if error is None:
        return {"caption": result.caption}
    return {"error": str(error)}

Best Practices Summary

1. Memory Management

  • Use quantization (8-bit or 4-bit)
  • Enable low_cpu_mem_usage=True
  • Clear GPU cache periodically
  • Process in chunks for large datasets

2. GPU Utilization

  • Use dtype="float16" on modern GPUs
  • Enable Flash Attention 2 if available
  • Use device_map="auto" for multi-GPU
  • Monitor GPU memory usage

3. Batch Processing

  • Use native batch inference
  • Process folders directly
  • Choose appropriate chunk sizes
  • Disable progress in automation

4. Code Efficiency

  • Reuse model instances
  • Use list comprehensions
  • Minimize attribute checks
  • Handle errors efficiently

5. Production Deployment

  • Load model at startup
  • Use environment variables
  • Disable unnecessary features
  • Implement proper monitoring

Hardware Recommendations

Minimum Requirements

  • CPU: 4+ cores
  • RAM: 16GB+
  • GPU: 8GB VRAM (with quantization)
  • Storage: 50GB+ free space

Recommended for Production

  • CPU: 8+ cores
  • RAM: 32GB+
  • GPU: 16GB+ VRAM (NVIDIA A100, RTX 4090, etc.)
  • Storage: 100GB+ SSD

Optimal Configuration

  • CPU: 16+ cores
  • RAM: 64GB+
  • GPU: 24GB+ VRAM or multi-GPU setup
  • Storage: NVMe SSD for model cache

Related resources