Run Inference

Run inference

Execute predictions on images using loaded VLM models with single or batch inference modes.

📋

Prerequisites

Learn how to load models →

Overview

Once you've loaded a model, you can run inference using the model() call. The Vi SDK automatically handles:

  • Single or batch inference — Automatically detected based on input
  • Non-streaming by default — Returns (result, error) tuple for explicit error handling (use stream=True for real-time token generation)
  • Error handling — Consistent (result, error) tuple pattern in non-streaming mode
  • Progress tracking — Built-in progress bars for batch processing
  • Folder processing — Process entire directories recursively with recursive search

Streaming vs non-streaming

📘

Non-streaming is the default mode

When you call model(...) without specifying stream, it defaults to non-streaming mode and returns a (result, error) tuple for explicit error handling. Set stream=True to enable real-time token generation.

Learn about task types →

Streaming mode

Enable streaming mode with stream=True for real-time token generation in visual question answering tasks:

# Streaming mode - set stream=True to enable
stream = model(
    source="image.jpg",
    user_prompt="Describe this image",
    stream=True  # Enable streaming
)

# Iterate through tokens
for token in stream:
    print(token, end="", flush=True)

# Get final result
result = stream.get_final_completion()
print(f"\n\n{result.caption}")

Use when:

  • Building interactive applications with real-time feedback
  • Displaying results in real-time to users
  • Immediate feedback is important for VQA tasks

Non-streaming mode (default)

Non-streaming is the default mode. Returns complete results as (result, error) tuples for explicit error handling:

# Non-streaming mode (default) - returns (result, error) tuple
result, error = model(
    source="image.jpg",
    user_prompt="Describe this image"
    # stream=False is the default, no need to specify
)

if error is None:
    # Access result fields based on response type
    # See prediction schemas for all available fields
    print(result.result)  # Generic access
else:
    print(f"Error: {error}")

Use when:


Single image inference

Basic usage

Run inference on a single image with non-streaming mode (default):

from vi.inference import ViModel

# Load model
model = ViModel(
    run_id="your-run-id",
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

# Run inference (non-streaming is default)
result, error = model(
    source="/path/to/image.jpg",
    user_prompt="What objects are in this image?"
)

if error is None:
    print(f"Result: {result.result}")
else:
    print(f"Error: {error}")

# For detailed information on accessing result fields, see prediction schemas

For real-time token generation with streaming mode:

# Set stream=True for streaming
stream = model(
    source="image.jpg",
    user_prompt="What objects are in this image?",
    stream=True  # Enable streaming
)

# Iterate through tokens as they're generated
for token in stream:
    print(token, end="", flush=True)

# Get final result
result = stream.get_final_completion()
# Access result fields based on response type (see prediction schemas)
print(f"\n\n{result.result}")

With generation config

Control the output generation with generation parameters:

result, error = model(
    source="image.jpg",
    user_prompt="Describe this image in detail",
    generation_config={
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.9
    }
)

Learn more about generation config →

Different image sources

Support for various image formats and paths with Vi SDK:

# Local file path
result, error = model(source="./images/photo.jpg")

# Absolute path
result, error = model(source="/home/user/images/photo.jpg")

# User home directory
result, error = model(source="~/Pictures/photo.png")

Supported formats: .jpg, .jpeg, .png, .bmp, .gif, .tiff, .tif, .webp


Batch inference

Process multiple images efficiently with automatic progress tracking and error handling.

💡

Batch inference benefits

  • Automatic progress bars — Track processing status in real-time
  • Individual error handling — Failed images don't stop the batch
  • Memory efficient — Processes images sequentially to manage GPU memory
  • Flexible inputs — Mix files, folders, and paths in a single call

Learn about performance optimization →

Process list of files

# List of image paths
image_paths = ["image1.jpg", "image2.jpg", "image3.jpg"]

# Run batch inference
results = model(
    source=image_paths,
    user_prompt="Describe this image",
    show_progress=True  # Shows progress bar (default)
)

# Process results - each item is a (result, error) tuple
for i, (result, error) in enumerate(results):
    if error is None:
        # Access result fields based on response type
        # See prediction schemas documentation for details
        print(f"Image {i+1}: {result.result}")
    else:
        print(f"Image {i+1} failed: {error}")

Process entire folder

Automatically process all images in a directory with supported formats:

# Process all images in a folder
results = model(
    source="./my_images/",
    user_prompt="Describe this image",
    show_progress=True
)

# Count successes
successful = sum(1 for _, error in results if error is None)
total = len(results)
print(f"Processed {successful}/{total} images successfully")

Recursive directory search

Search and process images in subdirectories recursively:

# Process all images in folder and all subdirectories
results = model(
    source="./dataset/",
    user_prompt="What's in this image?",
    recursive=True,  # Search subdirectories recursively
    show_progress=True
)

print(f"Processed {len(results)} images across all subdirectories")

Mix files and folders

Combine individual files and folders in a single batch inference call:

# Mix files and folders in the same call
results = model(
    source=[
        "./image1.jpg",        # Single file
        "./folder1/",          # All images in folder1
        "~/Pictures/photo.png", # User path
        "./dataset/",          # All images in dataset
    ],
    user_prompt="Analyze this image",
    recursive=False,  # Only immediate folder contents
    show_progress=True
)

Different prompts per image

Provide different user prompts for each image in batch processing:

images = ["car.jpg", "person.jpg", "building.jpg"]
prompts = [
    "What color is the car?",
    "How many people are visible?",
    "What type of building is this?"
]

# Each image gets its own prompt
results = model(
    source=images,
    user_prompt=prompts,
    show_progress=True
)

for image, prompt, (result, error) in zip(images, prompts, results):
    if error is None:
        # Access fields based on response type (see prediction schemas)
        print(f"{image} - {prompt}: {result.result}")
    else:
        print(f"{image} failed: {error}")
⚠️

Prompt Length Matching

When providing a list of prompts, the length must match the number of images:

# ✅ Good - matching lengths
results = model(
    source=["img1.jpg", "img2.jpg"],
    user_prompt=["Prompt 1", "Prompt 2"]
)

# ❌ Bad - length mismatch (raises ValueError)
results = model(
    source=["img1.jpg", "img2.jpg"],
    user_prompt=["Prompt 1"]  # Wrong!
)

Progress tracking

Enable or disable progress bar

Control progress display for batch inference operations:

# With progress bar (default)
results = model(
    source=image_list,
    user_prompt="Describe this",
    show_progress=True  # Default
)

# Without progress bar (silent mode)
results = model(
    source=image_list,
    user_prompt="Describe this",
    show_progress=False
)

Progress information

The progress bar shows:

  • Current image number / total images
  • Processing speed (images per second)
  • Estimated time remaining
  • Real-time success/failure counts
Running batch inference (45 / 100 images)... ━━━━━━╸━━━━━━━━ 45% 0:02:15

Performance tip

For large batch jobs, enable progress tracking to monitor processing speed and identify bottlenecks. Disable it in automated scripts to reduce overhead.

Learn more about optimization →


Error handling

Consistent error pattern

All inference calls return (result, error) tuples by default (non-streaming mode):

result, error = model(source="image.jpg")

if error is None:
    # Success - process result
    # See prediction schemas for accessing specific fields
    print(f"Success: {result.result}")
else:
    # Error - handle appropriately
    print(f"Failed: {error}")

    # Check error type for specific handling
    if isinstance(error, FileNotFoundError):
        print("Image file not found")
    elif "out of memory" in str(error).lower():
        print("GPU out of memory")

Batch error handling

Each image in a batch has its own error status:

results = model(
    source=["img1.jpg", "missing.jpg", "img3.jpg"],
    user_prompt="Describe this",
    show_progress=True
)

successful = []
failed = []

for img, (result, error) in zip(images, results):
    if error is None:
        successful.append((img, result))
    else:
        failed.append((img, error))

print(f"Successful: {len(successful)}")
print(f"Failed: {len(failed)}")

# Process failures
for img, error in failed:
    print(f"  {img}: {type(error).__name__} - {error}")

Graceful degradation

Continue batch processing even if some images fail:

results = model(
    source="./images/",
    user_prompt="Analyze this",
    recursive=True,
    show_progress=True
)

# Batch inference continues even if individual images fail
valid_results = [r for r, e in results if e is None]
print(f"Successfully processed {len(valid_results)} images")

# Handle failures separately
failures = [(r, e) for r, e in results if e is not None]
for result, error in failures:
    # Log or retry failed images
    print(f"Failed with: {error}")

Common workflows

Save results to JSON

Export inference results to JSON format for later analysis:

import json
from pathlib import Path

# Process folder
results = model(
    source="./images/",
    user_prompt="Describe this image",
    show_progress=True
)

# Save results
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

output_data = []
for result, error in results:
    output_data.append({
        "result": str(result.result) if error is None else None,
        "error": str(error) if error else None,
        "has_grounding": isinstance(result, PhraseGroundingResponse) if error is None else False
    })

with open("results.json", "w") as f:
    json.dump(output_data, f, indent=2)

print(f"Saved {len(output_data)} results to results.json")

Process with metadata

Track additional information alongside inference results:

from pathlib import Path
from datetime import datetime
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

image_dir = Path("./test_images")
image_files = list(image_dir.glob("*.jpg"))

results = model(
    source=image_files,
    user_prompt="Describe this image",
    show_progress=True
)

# Helper to extract text from any response type
def get_text(result):
    if isinstance(result, VQAResponse):
        return result.result.answer
    elif isinstance(result, PhraseGroundingResponse):
        return result.result.sentence
    else:
        return result.result

# Save with metadata
output = []
for img_path, (result, error) in zip(image_files, results):
    output.append({
        "filename": img_path.name,
        "path": str(img_path),
        "timestamp": datetime.now().isoformat(),
        "text": get_text(result) if error is None else None,
        "error": str(error) if error else None,
        "success": error is None
    })

# Calculate statistics
success_rate = sum(1 for item in output if item["success"]) / len(output)
print(f"Success rate: {success_rate:.1%}")

Retry failed images

Implement retry logic for failed images with error handling:

def process_with_retry(model, images, max_retries=3):
    """Process images with retry logic."""
    results = {}
    remaining = list(images)

    for attempt in range(max_retries):
        if not remaining:
            break

        print(f"Attempt {attempt + 1}/{max_retries} - Processing {len(remaining)} images")

        batch_results = model(
            source=remaining,
            user_prompt="Describe this image",
            show_progress=True
        )

        new_remaining = []
        for img, (result, error) in zip(remaining, batch_results):
            if error is None:
                results[img] = result
            else:
                new_remaining.append(img)

        remaining = new_remaining

    return results, remaining

# Usage
successful, failed = process_with_retry(model, image_list)
print(f"Successful: {len(successful)}, Failed: {len(failed)}")

Chunked processing for large datasets

Process large datasets in chunks for better memory management:

from pathlib import Path

def process_in_chunks(model, image_dir, chunk_size=100):
    """Process images in chunks to manage memory."""
    image_dir = Path(image_dir)
    all_images = list(image_dir.glob("*.jpg"))

    print(f"Processing {len(all_images)} images in chunks of {chunk_size}")

    all_results = []
    for i in range(0, len(all_images), chunk_size):
        chunk = all_images[i:i+chunk_size]
        print(f"\nChunk {i//chunk_size + 1}: Processing {len(chunk)} images...")

        results = model(
            source=chunk,
            user_prompt="Describe this image",
            show_progress=True
        )

        all_results.extend(results)

        # Optional: Clear GPU cache between chunks
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    return all_results

# Usage
results = process_in_chunks(model, "./large_dataset", chunk_size=100)

Best practices

Follow these best practices for efficient inference with the Vi SDK:

Use batch inference

Process multiple images at once with native batch support:

# ✅ Good - native batch inference
results = model(
    source=["img1.jpg", "img2.jpg", "img3.jpg"],
    user_prompt="Describe this",
    show_progress=True
)

# ❌ Bad - manual loop (slower, no progress)
results = []
for image in ["img1.jpg", "img2.jpg", "img3.jpg"]:
    result, error = model(source=image)
    results.append((result, error))

Handle errors gracefully

Always check error status in your code:

# ✅ Good - proper error handling
result, error = model(source="image.jpg")
if error is None:
    print(result.result)
else:
    logging.error(f"Inference failed: {error}")

# ❌ Bad - assuming success
result, _ = model(source="image.jpg")
print(result.result)  # May crash if error occurred

Use progress bars

Enable progress tracking for batch jobs:

# ✅ Good - with progress tracking
results = model(
    source="./images/",
    user_prompt="Describe this",
    show_progress=True  # Default
)

# ❌ Bad - no feedback for long-running jobs
results = model(
    source="./images/",
    user_prompt="Describe this",
    show_progress=False
)

Process folders directly

Use folder paths instead of manual file listing:

# ✅ Good - direct folder processing
results = model(
    source="./images/",
    user_prompt="Describe this",
    recursive=True
)

# ❌ Bad - manual file listing
from pathlib import Path
images = [str(p) for p in Path("./images").rglob("*.jpg")]
results = model(source=images, user_prompt="Describe this")

Reuse model instance

Create model once, reuse many times for better performance:

# ✅ Good - reuse model
model = ViModel(run_id="your-run-id")
for image in images:
    result, error = model(source=image)

# ❌ Bad - recreate model each time
for image in images:
    model = ViModel(run_id="your-run-id")  # Wasteful!
    result, error = model(source=image)

Performance tips

Optimize inference performance with these techniques:

Memory management

Clear GPU cache periodically for long-running batch jobs:

import gc
import torch

for i, image in enumerate(images):
    result, error = model(source=image)

    # Clear cache every 100 images
    if i % 100 == 0:
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        gc.collect()

Optimal batch sizes

Balance speed and memory usage for your GPU with chunked processing:

# For GPUs with 8GB VRAM
small_batches = process_in_chunks(model, images, chunk_size=50)

# For GPUs with 16GB+ VRAM
large_batches = process_in_chunks(model, images, chunk_size=200)

Disable progress for scripts

Reduce overhead in automated scripts by disabling progress bars:

# In automated pipelines
results = model(
    source=images,
    user_prompt="Describe this",
    show_progress=False  # Reduce overhead
)

Related resources