Inference

The Vi SDK lets you load trained vision-language models (VLMs) and run inference on images or videos from Python. You get structured outputs for visual question answering and phrase grounding, with built-in support for batch processing, streaming, and quantized loading. Video uses the same call pattern as images; see Run Inference.

Before You Start

The Vi SDK works with models trained on the Datature Vi platform. Follow the quickstart to train your first model, or install the SDK if you already have one.

Get started with the Vi SDK →

Quick start

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

result, error = model(
    source="/path/to/image.jpg",
    user_prompt="What objects are in this image?"
)

if error is None:
    print(f"Result: {result.result}")
Expected output
Output
Result: A red sedan is parked in front of a brick building. A person stands near the entrance.

For streaming mode with real-time token output:

stream = model(
    source="image.jpg",
    user_prompt="Describe this image",
    stream=True
)

for token in stream:
    print(token, end="", flush=True)

result = stream.get_final_completion()
print(f"\n\nFinal: {result.result}")

What Datature Vi supports

Supported models: Qwen3.5, Qwen3-VL, Qwen2.5-VL, NVILA-Lite, Cosmos-Reason1, Cosmos-Reason2, InternVL3.5

Coming soon: DeepSeek OCR, Gemma 4, LLaVA-NeXT

Task types:

Visual Question Answering (VQA)

Answer questions about images using a trained VLM.

Phrase Grounding

Detect and locate objects with bounding boxes.

Freeform Text

Provide flexible model output in any annotation format specific to your domain.

Learn about task types →

How inference works

Non-streaming (default)

Calling model(...) returns a (result, error) tuple. This pattern makes error handling explicit and works well for batch processing and automated workflows.

result, error = model(source="image.jpg", user_prompt="What's in this image?")

if error is None:
    print(result.result)
else:
    print(f"Error: {error}")

Streaming

Pass stream=True to get token-by-token output as the model generates it. Call stream.get_final_completion() when you need the complete structured result.

Learn more about inference modes →

Model loading

Load from Datature Vi, HuggingFace, or a local path. The SDK caches models locally after the first download.

# From Datature Vi
model = ViModel(run_id="your-run-id")

# From HuggingFace
model = ViModel(pretrained_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct")

# With 8-bit quantization (cuts memory ~50%)
model = ViModel(run_id="your-run-id", load_in_8bit=True, device_map="auto")

Learn about model loading →

Batch processing

Pass a list of paths, a folder path, or mix both. The SDK processes them and returns one (result, error) tuple per file (image paths, or video paths when you pass an explicit list; folders are expanded to images only).

results = model(
    source="./images/",
    user_prompt="Describe this image",
    recursive=True,
    show_progress=True
)

for result, error in results:
    if error is None:
        print(result.result)

Learn about batch inference →

Common workflows

Dataset annotation

results = model(
    source="./unlabeled_images/",
    user_prompt="Describe this image concisely",
    recursive=True,
    show_progress=True
)

annotations = [r.result for r, e in results if e is None]

Quality control

test_cases = [
    {"image": "defect1.jpg", "expected": "defect"},
    {"image": "good1.jpg", "expected": "no defect"}
]

for test in test_cases:
    result, error = model(
        source=test["image"],
        user_prompt="Does this part have defects?"
    )
    match = test["expected"] in str(result.result).lower() if error is None else False
    print(f"{'PASS' if match else 'FAIL'}: {test['image']}")

Model comparison

models = {
    "v1": ViModel(run_id="run_v1"),
    "v2": ViModel(run_id="run_v2")
}

for name, m in models.items():
    result, error = m(source="test.jpg")
    if error is None:
        print(f"{name}: {result.result}")

Memory and GPU tips

# ~50% memory reduction
model = ViModel(run_id="your-run-id", load_in_8bit=True, device_map="auto")
# ~75% memory reduction
model = ViModel(run_id="your-run-id", load_in_4bit=True, device_map="auto")
# Faster inference on Ampere+ GPUs
model = ViModel(
    run_id="your-run-id",
    attn_implementation="flash_attention_2",
    dtype="float16",
    device_map="auto"
)

Full performance guide →

Next steps

Load Models

Load from Datature Vi or HuggingFace. Configure quantization, device mapping, and caching.

Run Inference

Single file, video, and batch inference, streaming, progress tracking, and error handling.

Task Types

When to use VQA vs phrase grounding, prompt guidelines, and response structures.