Inference

Overview

The Vi SDK provides tools for loading trained vision-language models and running inference on images with structured outputs.

📋

Prerequisites

Get started with Vi SDK →

Key features:

ℹ️

Currently supported

Models: Qwen2.5-VL, InternVL 3.5, Cosmos Reason1, NVILA

Coming Soon: DeepSeek OCR, LLaVA-NeXT

Task types:

More models and task types coming in future releases.

Learn about task types →


Quick start

from vi.inference import ViModel

# Load model
model = ViModel(
    run_id="your-run-id",
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

# Run inference (non-streaming is default)
result, error = model(
    source="/path/to/image.jpg",
    user_prompt="What objects are in this image?"
)

if error is None:
    # Access result fields (see prediction schemas for details)
    print(f"Result: {result.result}")

    # Visualize predictions (optional - works with VQAResponse and PhraseGroundingResponse)
    from vi.inference.utils.visualize import visualize_prediction
    image = visualize_prediction(image_path="/path/to/image.jpg", prediction=result)
    image.save("output.jpg")

For streaming mode with real-time token generation:

# Use stream=True for streaming mode
stream = model(
    source="image.jpg",
    user_prompt="Describe this image",
    stream=True  # Enable streaming
)

# Iterate through tokens as they're generated
for token in stream:
    print(token, end="", flush=True)

# Get final result
result = stream.get_final_completion()
print(f"\n\nFinal result: {result.caption}")

Installation

Install the SDK with inference dependencies:

pip install vi-sdk[inference]

This includes PyTorch, Transformers, and structured output generation tools.

Complete installation guide →


Core concepts

Inference modes

Vi SDK supports two inference modes:

  • Non-streaming (default) — Returns (result, error) tuple for explicit error handling
  • Streaming — Real-time token generation, returns Stream object for iteration

Use stream=True to enable streaming mode for real-time token generation.

Model loading

Load models from Datature Vi or HuggingFace with automatic caching and optimization options:

# From Datature Vi
model = ViModel(run_id="your-run-id")

# From HuggingFace
model = ViModel(pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct")

# With optimization
model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,  # 8-bit quantization
    device_map="auto"    # Auto GPU distribution
)

Learn about loading models →

Running inference

Process single images or batch process folders:

# Single image (non-streaming is default)
result, error = model(source="image.jpg")

# Batch processing
results = model(
    source=["img1.jpg", "img2.jpg", "img3.jpg"],
    user_prompt="Describe this",
    show_progress=True
)

# Process entire folder
results = model(
    source="./images/",
    recursive=True,
    show_progress=True
)

Learn about running inference →

Task types

Support for Visual Question Answering and Phrase Grounding:

# Visual Question Answering (non-streaming is default)
result, error = model(
    source="image.jpg",
    user_prompt="How many people are in this image?"
)

# Phrase Grounding (prompt optional)
result, error = model(
    source="image.jpg",
    user_prompt="Locate all objects"
)

Learn about task types →


Documentation


Common workflows

Dataset annotation

Generate annotations for unlabeled images using batch inference:

results = model(
    source="./unlabeled_images/",
    user_prompt="Describe this image concisely",
    recursive=True,
    show_progress=True
)

annotations = [r.result for r, e in results if e is None]

Quality control

Validate predictions against expected outputs:

test_cases = [
    {"image": "defect1.jpg", "expected": "defect"},
    {"image": "good1.jpg", "expected": "no defect"}
]

for test in test_cases:
    result, error = model(
        source=test["image"],
        user_prompt="Does this have defects?"
    )
    match = test["expected"] in str(result.result).lower() if error is None else False
    print(f"{'✅' if match else '❌'} {test['image']}")

Model comparison

Compare different model versions:

models = {
    "v1": ViModel(run_id="run_v1"),
    "v2": ViModel(run_id="run_v2")
}

for name, model in models.items():
    result, error = model(source="test.jpg")
    if error is None:
        print(f"{name}: {result.result}")

Performance tips

Memory optimization

# Use quantization for large models
model = ViModel(
    run_id="your-run-id",
    load_in_8bit=True,  # ~50% memory reduction
    device_map="auto"
)

# Or 4-bit for maximum compression
model = ViModel(
    run_id="your-run-id",
    load_in_4bit=True,  # ~75% memory reduction
    device_map="auto"
)

GPU utilization

# Enable Flash Attention 2 and mixed precision
model = ViModel(
    run_id="your-run-id",
    attn_implementation="flash_attention_2",
    dtype="float16",
    device_map="auto"
)

Batch processing

# Use native batch inference
results = model(
    source="./images/",  # Process entire folder
    recursive=True,
    show_progress=True
)

# Process in chunks for large datasets
def process_chunks(images, chunk_size=100):
    for i in range(0, len(images), chunk_size):
        chunk = images[i:i+chunk_size]
        yield model(source=chunk)

Complete optimization guide →


Related resources