Improve Performance
Datature Vi inference performance depends on three factors: how the model loads, how you run predictions, and how you manage GPU memory. This page covers the key levers for each.
- A loaded model ready for inference
- Understanding of model loading options and batch inference
- GPU hardware (optional but strongly recommended)
Memory management
Quantization
Quantization reduces model weight precision to cut memory usage. The tradeoff is a small accuracy reduction, which is acceptable for most production use cases.
8-bit quantization (~50% memory reduction)
from vi.inference import ViModel
model = ViModel(
run_id="your-run-id",
load_in_8bit=True,
device_map="auto"
)Cuts memory by about half with minimal accuracy impact. Start here when your GPU has limited VRAM.
4-bit quantization (~75% memory reduction)
model = ViModel(
run_id="your-run-id",
load_in_4bit=True,
device_map="auto"
)Cuts memory by about three-quarters. There is a slight accuracy tradeoff, but this lets you run larger models on smaller GPUs.
Mixed precision
FP16 runs about 2x faster than FP32 and uses 50% less memory, with negligible quality change on modern GPUs.
model = ViModel(
run_id="your-run-id",
dtype="float16",
device_map="auto"
)Low CPU memory usage
Reduces peak CPU memory during model loading. Worth enabling when loading large models, especially with quantization.
model = ViModel(
run_id="your-run-id",
low_cpu_mem_usage=True,
device_map="auto"
)Clear GPU cache
For long-running batch jobs, clearing the GPU cache periodically prevents fragmentation from accumulating.
import torch
import gc
for i, image in enumerate(images):
result, error = model(source=image)
if i % 100 == 0:
torch.cuda.empty_cache()
gc.collect()GPU utilization
Check GPU availability
import torch
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA: {torch.version.cuda}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
print("No GPU available, using CPU")Monitor GPU memory
def print_gpu_utilization():
if torch.cuda.is_available():
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"GPU Memory: Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")
print_gpu_utilization() # Before loading
model = ViModel(run_id="your-run-id")
print_gpu_utilization() # After loadingFlash Attention 2
Flash Attention 2 speeds up inference by 2-3x on compatible GPUs and reduces memory for long sequences. It requires the flash-attn package and an Ampere-or-newer GPU (RTX 30xx, A100, H100, etc.).
model = ViModel(
run_id="your-run-id",
attn_implementation="flash_attention_2",
dtype="float16"
)Multi-GPU distribution
device_map="auto" distributes the model across all available GPUs automatically, which is the right default for multi-GPU servers.
model = ViModel(
run_id="your-run-id",
device_map="auto"
)Batch processing strategies
Use native batch inference
The SDK's native batch support is faster than running a loop of single-image calls. It handles progress tracking and individual error isolation.
# Good: native batch inference
results = model(
source=["img1.jpg", "img2.jpg", "img3.jpg"],
user_prompt="Describe this",
show_progress=True
)
# Avoid: manual loop (slower, no progress tracking)
results = []
for img in ["img1.jpg", "img2.jpg", "img3.jpg"]:
result, error = model(source=img)
results.append((result, error))Process folders directly
Folder paths avoid the overhead of building file lists manually.
# Good: pass folder path directly
results = model(source="./images/", recursive=True)
# Avoid: manual file listing
from pathlib import Path
images = list(Path("./images").glob("*.jpg"))
results = model(source=images)Chunk large datasets
Process large datasets in chunks to keep GPU memory under control. Clear the cache between chunks.
import torch
from pathlib import Path
def process_in_chunks(model, image_dir, chunk_size=100):
all_images = list(Path(image_dir).glob("*.jpg"))
print(f"Processing {len(all_images)} images in chunks of {chunk_size}")
all_results = []
for i in range(0, len(all_images), chunk_size):
chunk = all_images[i:i + chunk_size]
results = model(
source=chunk,
user_prompt="Describe this",
show_progress=True
)
all_results.extend(results)
torch.cuda.empty_cache()
return all_results
# Recommended chunk sizes by VRAM:
# 8 GB VRAM → chunk_size=50
# 16 GB VRAM → chunk_size=100
# 24 GB+ VRAM → chunk_size=200Model loading strategies
Reuse model instances
Creating a ViModel loads weights from disk into GPU memory. Do this once at startup, not inside a loop.
# Good: create once, reuse
model = ViModel(run_id="your-run-id")
for image in images:
result, error = model(source=image)
# Avoid: recreating on each iteration
for image in images:
model = ViModel(run_id="your-run-id") # loads weights every time
result, error = model(source=image)Inspect before loading
Check model size before committing to a loading strategy. This reads only metadata files and requires no GPU memory.
info = ViModel.inspect(run_id="your-run-id")
print(f"Model size: {info.size_gb:.2f} GB")
if info.size_gb > 10:
model = ViModel(run_id="your-run-id", load_in_8bit=True)
else:
model = ViModel(run_id="your-run-id")Use the model cache
After the first download, the SDK caches the model locally at ~/.datature/vi/models/. Subsequent loads are fast. Caching is on by default; you don't need to configure anything.
Code improvement patterns
Efficient error handling
# Good: check error status
result, error = model(source="image.jpg")
if error is None:
print(result.result)
else:
print(f"Failed: {error}")
# Avoid: using try/except for normal control flow
try:
result = some_function()
except Exception:
passMinimize attribute checks
result, error = model(source="image.jpg")
if error is None:
has_groundings = hasattr(result, "result") and hasattr(result.result, "groundings")
if has_groundings:
for grounding in result.result.groundings:
process(grounding)Use list comprehensions
# Good: list comprehension
successful = [r for r, e in results if e is None]
# Avoid: manual append loop
successful = []
for r, e in results:
if e is None:
successful.append(r)Performance benchmarking
Measure inference time
import time
def benchmark_inference(model, image_path, iterations=10):
times = []
model(source=image_path) # warm-up
for _ in range(iterations):
start = time.time()
model(source=image_path)
times.append(time.time() - start)
avg_time = sum(times) / len(times)
print(f"Average: {avg_time:.3f}s")
print(f"Throughput: {1 / avg_time:.1f} images/sec")
benchmark_inference(model, "test.jpg")Compare configurations
import time
configs = [
{"name": "Float32", "dtype": "float32"},
{"name": "Float16", "dtype": "float16"},
{"name": "8-bit", "load_in_8bit": True},
{"name": "4-bit", "load_in_4bit": True},
]
for config in configs:
name = config.pop("name")
m = ViModel(run_id="your-run-id", **config)
start = time.time()
m(source="test.jpg")
elapsed = time.time() - start
print(f"{name}: {elapsed:.3f}s")Production configuration
Recommended production configuration
import os
model = ViModel(
secret_key=os.getenv("VI_SECRET_KEY"),
organization_id=os.getenv("VI_ORG_ID"),
run_id="your-run-id",
device_map="auto",
dtype="float16",
attn_implementation="flash_attention_2",
low_cpu_mem_usage=True,
load_in_8bit=True # if memory is constrained
)Disable progress display in automated pipelines
results = model(
source=images,
user_prompt="Describe this",
show_progress=False
)Load model once at application startup
from fastapi import FastAPI
app = FastAPI()
model = None
@app.on_event("startup")
async def load_model():
global model
model = ViModel(run_id="your-run-id")
@app.post("/predict")
async def predict(image_path: str):
result, error = model(source=image_path)
if error is None:
return {"result": result.result}
return {"error": str(error)}Hardware recommendations
For the optimal tier, NVMe storage reduces model cache load time on first inference.
Related resources
Updated about 1 month ago
