Inference
Overview
The Vi SDK provides tools for loading trained vision-language models and running inference on images with structured outputs.
Prerequisites
- Vi SDK installed with inference dependencies
- A trained model or access to HuggingFace models
- Secret key for authentication
- Understanding of task types — VQA or phrase grounding
Key features:
- Load models from Datature Vi or HuggingFace
- Visual question answering and phrase grounding
- Batch processing with automatic progress tracking
- Streaming and non-streaming inference modes
- Memory optimization with quantization
- GPU acceleration and Flash Attention 2 support
Currently supportedModels: Qwen2.5-VL, InternVL 3.5, Cosmos Reason1, NVILA
Coming Soon: DeepSeek OCR, LLaVA-NeXT
Task types:
- Visual question answering — Answer questions about images
- Phrase grounding — Detect and locate objects with bounding boxes
More models and task types coming in future releases.
Quick start
from vi.inference import ViModel
# Load model
model = ViModel(
run_id="your-run-id",
secret_key="your-secret-key",
organization_id="your-organization-id"
)
# Run inference (non-streaming is default)
result, error = model(
source="/path/to/image.jpg",
user_prompt="What objects are in this image?"
)
if error is None:
# Access result fields (see prediction schemas for details)
print(f"Result: {result.result}")
# Visualize predictions (optional - works with VQAResponse and PhraseGroundingResponse)
from vi.inference.utils.visualize import visualize_prediction
image = visualize_prediction(image_path="/path/to/image.jpg", prediction=result)
image.save("output.jpg")For streaming mode with real-time token generation:
# Use stream=True for streaming mode
stream = model(
source="image.jpg",
user_prompt="Describe this image",
stream=True # Enable streaming
)
# Iterate through tokens as they're generated
for token in stream:
print(token, end="", flush=True)
# Get final result
result = stream.get_final_completion()
print(f"\n\nFinal result: {result.caption}")Installation
Install the SDK with inference dependencies:
pip install vi-sdk[inference]This includes PyTorch, Transformers, and structured output generation tools.
Core concepts
Inference modes
Vi SDK supports two inference modes:
- Non-streaming (default) — Returns
(result, error)tuple for explicit error handling - Streaming — Real-time token generation, returns
Streamobject for iteration
Use stream=True to enable streaming mode for real-time token generation.
Model loading
Load models from Datature Vi or HuggingFace with automatic caching and optimization options:
# From Datature Vi
model = ViModel(run_id="your-run-id")
# From HuggingFace
model = ViModel(pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct")
# With optimization
model = ViModel(
run_id="your-run-id",
load_in_8bit=True, # 8-bit quantization
device_map="auto" # Auto GPU distribution
)Running inference
Process single images or batch process folders:
# Single image (non-streaming is default)
result, error = model(source="image.jpg")
# Batch processing
results = model(
source=["img1.jpg", "img2.jpg", "img3.jpg"],
user_prompt="Describe this",
show_progress=True
)
# Process entire folder
results = model(
source="./images/",
recursive=True,
show_progress=True
)Learn about running inference →
Task types
Support for Visual Question Answering and Phrase Grounding:
# Visual Question Answering (non-streaming is default)
result, error = model(
source="image.jpg",
user_prompt="How many people are in this image?"
)
# Phrase Grounding (prompt optional)
result, error = model(
source="image.jpg",
user_prompt="Locate all objects"
)Documentation
Load models from Datature Vi or HuggingFace with quantization and device management
Single and batch inference with streaming and progress tracking
Visual question answering and phrase grounding explained
Complete reference for response types and available fields for each task type
Control temperature, max tokens, sampling, and output parameters
Access results, convert bounding boxes, and visualize predictions
Memory management, GPU utilization, and performance best practices
Common issues and solutions for inference problems
Common workflows
Dataset annotation
Generate annotations for unlabeled images using batch inference:
results = model(
source="./unlabeled_images/",
user_prompt="Describe this image concisely",
recursive=True,
show_progress=True
)
annotations = [r.result for r, e in results if e is None]Quality control
Validate predictions against expected outputs:
test_cases = [
{"image": "defect1.jpg", "expected": "defect"},
{"image": "good1.jpg", "expected": "no defect"}
]
for test in test_cases:
result, error = model(
source=test["image"],
user_prompt="Does this have defects?"
)
match = test["expected"] in str(result.result).lower() if error is None else False
print(f"{'✅' if match else '❌'} {test['image']}")Model comparison
Compare different model versions:
models = {
"v1": ViModel(run_id="run_v1"),
"v2": ViModel(run_id="run_v2")
}
for name, model in models.items():
result, error = model(source="test.jpg")
if error is None:
print(f"{name}: {result.result}")Performance tips
Memory optimization
# Use quantization for large models
model = ViModel(
run_id="your-run-id",
load_in_8bit=True, # ~50% memory reduction
device_map="auto"
)
# Or 4-bit for maximum compression
model = ViModel(
run_id="your-run-id",
load_in_4bit=True, # ~75% memory reduction
device_map="auto"
)GPU utilization
# Enable Flash Attention 2 and mixed precision
model = ViModel(
run_id="your-run-id",
attn_implementation="flash_attention_2",
dtype="float16",
device_map="auto"
)Batch processing
# Use native batch inference
results = model(
source="./images/", # Process entire folder
recursive=True,
show_progress=True
)
# Process in chunks for large datasets
def process_chunks(images, chunk_size=100):
for i in range(0, len(images), chunk_size):
chunk = images[i:i+chunk_size]
yield model(source=chunk)Related resources
- Vi SDK getting started — Quick start guide for the Vi SDK
- Load models — Load models from Datature Vi or HuggingFace
- Run inference — Single and batch inference guide
- Handle results — Process and visualize predictions
- Task types — VQA and phrase grounding explained
- Prediction schemas — Complete reference for all response types and available fields
- Optimize performance — Memory management and GPU utilization
- Troubleshoot issues — Common problems and solutions
- API resources — Complete SDK reference documentation
- Vi SDK installation — Installation instructions and requirements
- Deploy and test — End-to-end deployment workflow
- Train a model — Training guide for VLMs
- Evaluate a model — Assess model performance
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
