Optimize Performance
Optimize performance
Best practices for optimizing inference performance, memory usage, and throughput with the Vi SDK.
Prerequisites
- A loaded model ready for inference
- Understanding of model loading and inference basics
- GPU hardware (optional but recommended for optimal performance)
- Familiarity with batch processing
Overview
Key optimization strategies for VLM inference:
- Memory optimization — Reduce memory footprint with quantization
- GPU utilization — Maximize compute efficiency with Flash Attention
- Batch processing — Improve throughput with batching strategies
- Model loading — Faster initialization with device mapping
- Code patterns — Efficient implementation with best practices
Memory Optimization
Quantization
Reduce memory usage with minimal quality loss:
8-bit Quantization (~50% reduction)
from vi.inference import ViModel
model = ViModel(
run_id="your-run-id",
load_in_8bit=True,
device_map="auto"
)Benefits:
- ~50% memory reduction
- Minimal accuracy loss
- Faster inference
- Recommended for most use cases
4-bit Quantization (~75% reduction)
model = ViModel(
run_id="your-run-id",
load_in_4bit=True,
device_map="auto"
)Benefits:
- ~75% memory reduction
- Slight accuracy tradeoff
- Enables larger models
- Best for limited VRAM
Mixed Precision
Use FP16 for faster inference:
model = ViModel(
run_id="your-run-id",
dtype="float16",
device_map="auto"
)Benefits:
- 2x faster than FP32
- 50% less memory
- Minimal quality loss
- Supported on modern GPUs
Low CPU Memory Usage
Optimize CPU memory during loading:
model = ViModel(
run_id="your-run-id",
low_cpu_mem_usage=True,
device_map="auto"
)Clear GPU Cache
Free unused memory periodically:
import torch
import gc
# After batch processing
torch.cuda.empty_cache()
gc.collect()
# In processing loop
for i, image in enumerate(images):
result, error = model(source=image)
if i % 100 == 0:
torch.cuda.empty_cache()
gc.collect()GPU Utilization
Check GPU Availability
import torch
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
print("No GPU available")Monitor GPU Usage
def print_gpu_utilization():
"""Print current GPU memory usage."""
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")
# Check before loading
print_gpu_utilization()
# Load model
model = ViModel(run_id="your-run-id")
# Check after loading
print_gpu_utilization()Flash Attention 2
Enable Flash Attention 2 for faster inference:
model = ViModel(
run_id="your-run-id",
attn_implementation="flash_attention_2",
dtype="float16"
)Benefits:
- Up to 2-3x faster
- Lower memory usage
- Better scaling for long sequences
- Requires compatible GPU (Ampere or newer)
Multi-GPU Distribution
Automatically distribute across GPUs:
model = ViModel(
run_id="your-run-id",
device_map="auto" # Automatically distributes
)Batch Processing Optimization
Use Native Batch Inference
# ✅ Good - native batch inference
results = model(
source=["img1.jpg", "img2.jpg", "img3.jpg"],
user_prompt="Describe this",
show_progress=True
)
# ❌ Bad - manual loop
results = []
for img in ["img1.jpg", "img2.jpg", "img3.jpg"]:
result, error = model(source=img)
results.append((result, error))Process Folders Directly
# ✅ Good - direct folder processing
results = model(
source="./images/",
recursive=True,
show_progress=True
)
# ❌ Bad - manual file listing
from pathlib import Path
images = list(Path("./images").glob("*.jpg"))
results = model(source=images)Optimal Batch Sizes
Balance speed and memory:
def process_in_chunks(model, images, chunk_size=100):
"""Process images in optimal chunks."""
all_results = []
for i in range(0, len(images), chunk_size):
chunk = images[i:i+chunk_size]
results = model(
source=chunk,
user_prompt="Describe this",
show_progress=True
)
all_results.extend(results)
# Clear cache between chunks
torch.cuda.empty_cache()
return all_results
# Recommended chunk sizes by VRAM
# 8GB: chunk_size=50
# 16GB: chunk_size=100
# 24GB+: chunk_size=200Model Loading Optimization
Reuse Model Instances
# ✅ Good - create once, reuse
model = ViModel(run_id="your-run-id")
for image in images:
result, error = model(source=image)
# ❌ Bad - recreate each time
for image in images:
model = ViModel(run_id="your-run-id") # Wasteful!
result, error = model(source=image)Use Cached Models
# First time - downloads model
model1 = ViModel(run_id="your-run-id") # Slow
# Subsequent times - uses cache
model2 = ViModel(run_id="your-run-id") # FastInspect Before Loading
Check model requirements first:
# Inspect without loading
info = ViModel.inspect(run_id="your-run-id")
print(f"Model size: {info.size_gb:.2f} GB")
# Decide loading strategy
if info.size_gb > 10:
# Use quantization
model = ViModel(run_id="your-run-id", load_in_8bit=True)
else:
# Load normally
model = ViModel(run_id="your-run-id")Code Optimization Patterns
Efficient Error Handling
# ✅ Good - check error status
result, error = model(source="image.jpg")
if error is None:
print(result.caption)
else:
logging.error(f"Failed: {error}")
# ❌ Bad - try/except for expected flow
try:
result = some_function()
except Exception:
passMinimize Attribute Checks
# ✅ Good - check once
result, error = model(source="image.jpg")
if error is None:
has_phrases = hasattr(result, 'grounded_phrases')
if has_phrases:
for phrase in result.grounded_phrases:
process(phrase)
# ❌ Bad - check repeatedly
for phrase in result.grounded_phrases if hasattr(result, 'grounded_phrases') else []:
if hasattr(result, 'grounded_phrases'): # Redundant!
process(phrase)Use List Comprehensions
# ✅ Good - list comprehension
successful = [r for r, e in results if e is None]
# ❌ Bad - manual loop
successful = []
for r, e in results:
if e is None:
successful.append(r)Performance Benchmarking
Measure Inference Time
import time
def benchmark_inference(model, image_path, iterations=10):
"""Benchmark inference performance."""
times = []
# Warm-up
model(source=image_path)
# Measure
for _ in range(iterations):
start = time.time()
result, error = model(source=image_path)
elapsed = time.time() - start
times.append(elapsed)
avg_time = sum(times) / len(times)
print(f"Average inference time: {avg_time:.3f}s")
print(f"Throughput: {1/avg_time:.1f} images/sec")
benchmark_inference(model, "test.jpg")Compare Configurations
def compare_configs():
"""Compare different model configurations."""
configs = [
{"name": "Float32", "dtype": "float32"},
{"name": "Float16", "dtype": "float16"},
{"name": "8-bit", "load_in_8bit": True},
{"name": "4-bit", "load_in_4bit": True},
]
for config in configs:
name = config.pop("name")
model = ViModel(run_id="your-run-id", **config)
start = time.time()
result, error = model(source="test.jpg")
elapsed = time.time() - start
print(f"{name}: {elapsed:.3f}s")
compare_configs()Production Optimization
Configuration for Production
model = ViModel(
secret_key=os.getenv("VI_SECRET_KEY"),
organization_id=os.getenv("VI_ORG_ID"),
run_id="your-run-id",
# Optimize for production
device_map="auto",
dtype="float16",
attn_implementation="flash_attention_2",
low_cpu_mem_usage=True,
load_in_8bit=True # If memory constrained
)Disable Progress for Automation
# In production/automated scripts
results = model(
source=images,
user_prompt="Describe this",
show_progress=False # Reduce overhead
)Connection Pooling
Reuse model instances across requests:
# FastAPI example
from fastapi import FastAPI
app = FastAPI()
# Load model once at startup
@app.on_event("startup")
async def load_model():
global model
model = ViModel(run_id="your-run-id")
@app.post("/predict")
async def predict(image_path: str):
result, error = model(source=image_path)
if error is None:
return {"caption": result.caption}
return {"error": str(error)}Best Practices Summary
1. Memory Management
- Use quantization (8-bit or 4-bit)
- Enable
low_cpu_mem_usage=True - Clear GPU cache periodically
- Process in chunks for large datasets
2. GPU Utilization
- Use
dtype="float16"on modern GPUs - Enable Flash Attention 2 if available
- Use
device_map="auto"for multi-GPU - Monitor GPU memory usage
3. Batch Processing
- Use native batch inference
- Process folders directly
- Choose appropriate chunk sizes
- Disable progress in automation
4. Code Efficiency
- Reuse model instances
- Use list comprehensions
- Minimize attribute checks
- Handle errors efficiently
5. Production Deployment
- Load model at startup
- Use environment variables
- Disable unnecessary features
- Implement proper monitoring
Hardware Recommendations
Minimum Requirements
- CPU: 4+ cores
- RAM: 16GB+
- GPU: 8GB VRAM (with quantization)
- Storage: 50GB+ free space
Recommended for Production
- CPU: 8+ cores
- RAM: 32GB+
- GPU: 16GB+ VRAM (NVIDIA A100, RTX 4090, etc.)
- Storage: 100GB+ SSD
Optimal Configuration
- CPU: 16+ cores
- RAM: 64GB+
- GPU: 24GB+ VRAM or multi-GPU setup
- Storage: NVMe SSD for model cache
Related resources
- Vi SDK inference — Getting started with inference
- Load models — Load models with quantization and device mapping
- Run inference — Execute single and batch predictions efficiently
- Handle results — Process and visualize predictions
- Configure generation — Control output temperature and sampling
- Troubleshoot issues — Common performance problems and solutions
- Download a model — Export models for deployment
- Train a model — Learn about training workflows
- Vi SDK getting started — Quick start guide for the Vi SDK
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
