Inference

Overview

The Vi SDK provides tools for loading trained vision-language models and running inference on images with structured outputs.

📋
Prerequisites

Vi SDK installed with inference dependencies

A trained model or access to HuggingFace models

Secret key for authentication

Understanding of task types — VQA or phrase grounding

Get started with Vi SDK →

Key features:

Load models from Datature Vi or HuggingFace
Visual question answering and phrase grounding
Batch processing with automatic progress tracking
Streaming and non-streaming inference modes
Memory optimization with quantization
GPU acceleration and Flash Attention 2 support

ℹ️
Currently supported
Models: Qwen2.5-VL, InternVL 3.5, Cosmos Reason1, NVILA
Coming Soon: DeepSeek OCR, LLaVA-NeXT
Task types:

Visual question answering — Answer questions about images

Phrase grounding — Detect and locate objects with bounding boxes

More models and task types coming in future releases.
Learn about task types →

Quick start

from vi.inference import ViModel

# Load model
model = ViModel(
    run_id="your-run-id",
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

# Run inference (non-streaming is default)
result, error = model(
    source="/path/to/image.jpg",
    user_prompt="What objects are in this image?"
)

if error is None:
    # Access result fields (see prediction schemas for details)
    print(f"Result: {result.result}")

    # Visualize predictions (optional - works with VQAResponse and PhraseGroundingResponse)
    from vi.inference.utils.visualize import visualize_prediction
    image = visualize_prediction(image_path="/path/to/image.jpg", prediction=result)
    image.save("output.jpg")

For streaming mode with real-time token generation:

# Use stream=True for streaming mode
stream = model(
    source="image.jpg",
    user_prompt="Describe this image",
    stream=True  # Enable streaming
)

# Iterate through tokens as they're generated
for token in stream:
    print(token, end="", flush=True)

# Get final result
result = stream.get_final_completion()
print(f"\n\nFinal result: {result.caption}")

Installation

Install the SDK with inference dependencies:

pip install vi-sdk[inference]

This includes PyTorch, Transformers, and structured output generation tools.

Complete installation guide →

Core concepts

Inference modes

Vi SDK supports two inference modes:

Non-streaming (default) — Returns (result, error) tuple for explicit error handling
Streaming — Real-time token generation, returns Stream object for iteration

Use stream=True to enable streaming mode for real-time token generation.

Model loading

Load models from Datature Vi or HuggingFace with automatic caching and optimization options:

# From Datature Vi
model = ViModel(run_id="your-run-id")

# From HuggingFace
model = ViModel(pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct")

# With optimization
model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,  # 8-bit quantization
    device_map="auto"    # Auto GPU distribution
)

Learn about loading models →

Running inference

Process single images or batch process folders:

# Single image (non-streaming is default)
result, error = model(source="image.jpg")

# Batch processing
results = model(
    source=["img1.jpg", "img2.jpg", "img3.jpg"],
    user_prompt="Describe this",
    show_progress=True
)

# Process entire folder
results = model(
    source="./images/",
    recursive=True,
    show_progress=True
)

Learn about running inference →

Task types

Support for Visual Question Answering and Phrase Grounding:

# Visual Question Answering (non-streaming is default)
result, error = model(
    source="image.jpg",
    user_prompt="How many people are in this image?"
)

# Phrase Grounding (prompt optional)
result, error = model(
    source="image.jpg",
    user_prompt="Locate all objects"
)

Learn about task types →

Documentation

Load models

Load models from Datature Vi or HuggingFace with quantization and device management

Run inference

Single and batch inference with streaming and progress tracking

Task types

Visual question answering and phrase grounding explained

Prediction schemas

Complete reference for response types and available fields for each task type

Configure generation

Control temperature, max tokens, sampling, and output parameters

Handle results

Access results, convert bounding boxes, and visualize predictions

Optimize performance

Memory management, GPU utilization, and performance best practices

Troubleshoot issues

Common issues and solutions for inference problems

Common workflows

Dataset annotation

Generate annotations for unlabeled images using batch inference:

results = model(
    source="./unlabeled_images/",
    user_prompt="Describe this image concisely",
    recursive=True,
    show_progress=True
)

annotations = [r.result for r, e in results if e is None]

Quality control

Validate predictions against expected outputs:

test_cases = [
    {"image": "defect1.jpg", "expected": "defect"},
    {"image": "good1.jpg", "expected": "no defect"}
]

for test in test_cases:
    result, error = model(
        source=test["image"],
        user_prompt="Does this have defects?"
    )
    match = test["expected"] in str(result.result).lower() if error is None else False
    print(f"{'✅' if match else '❌'} {test['image']}")

Model comparison

Compare different model versions:

models = {
    "v1": ViModel(run_id="run_v1"),
    "v2": ViModel(run_id="run_v2")
}

for name, model in models.items():
    result, error = model(source="test.jpg")
    if error is None:
        print(f"{name}: {result.result}")

Performance tips

Memory optimization

# Use quantization for large models
model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,  # ~50% memory reduction
    device_map="auto"
)

# Or 4-bit for maximum compression
model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,  # ~75% memory reduction
    device_map="auto"
)

GPU utilization

# Enable Flash Attention 2 and mixed precision
model = ViModel(
    run_id="your-run-id",
    attn_implementation="flash_attention_2",
    dtype="float16",
    device_map="auto"
)

Batch processing

# Use native batch inference
results = model(
    source="./images/",  # Process entire folder
    recursive=True,
    show_progress=True
)

# Process in chunks for large datasets
def process_chunks(images, chunk_size=100):
    for i in range(0, len(images), chunk_size):
        chunk = images[i:i+chunk_size]
        yield model(source=chunk)

Complete optimization guide →

Related resources

Vi SDK getting started — Quick start guide for the Vi SDK
Load models — Load models from Datature Vi or HuggingFace
Run inference — Single and batch inference guide
Handle results — Process and visualize predictions
Task types — VQA and phrase grounding explained
Prediction schemas — Complete reference for all response types and available fields
Optimize performance — Memory management and GPU utilization
Troubleshoot issues — Common problems and solutions
API resources — Complete SDK reference documentation
Vi SDK installation — Installation instructions and requirements
Deploy and test — End-to-end deployment workflow
Train a model — Training guide for VLMs
Evaluate a model — Assess model performance

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects