Prediction Result Schemas

Prediction result schemas

Reference documentation for all prediction result schemas returned by different task types in the Vi SDK.

📋

Understanding result schemas

Each task type returns a specific response structure:

All responses include common fields like prompt, raw_output, and thinking.

Learn about task types →

Overview

Prediction responses inherit from a base PredictionResponse class and add task-specific fields. Understanding these schemas helps you:

  • Access the right fields — Know which properties are available for each task type
  • Handle different task types — Write robust code that works with all response types
  • Debug issues — Inspect raw_output and thinking fields when needed
  • Parse results correctly — Extract structured data from predictions

Base response fields

All prediction responses include these base fields from PredictionResponse:

Common fields

FieldTypeDescription
promptstrThe user prompt used for this prediction
raw_outputstr | NoneRaw model output string before parsing (includes <think> and <answer> tags if COT enabled)
thinkingstr | NoneExtracted content from <think>...</think> tags (available only when chain-of-thought is enabled)

Example: Accessing base fields

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")
result, error = model(source="image.jpg", user_prompt="Describe this image")

if error is None:
    # Access base fields (available on all response types)
    print(f"Prompt: {result.prompt}")
    print(f"Raw output length: {len(result.raw_output) if result.raw_output else 0}")
    
    # Check for chain-of-thought reasoning
    if result.thinking:
        print(f"Model's reasoning: {result.thinking}")

VQA response

Returned when using Visual Question Answering task type. Contains a text answer to the user's question.

Schema structure

class VQAResponse(PredictionResponse):
    """VQA response object.
    
    Attributes:
        prompt: The user prompt used for inference
        raw_output: Raw model output string (includes tags if COT enabled)
        thinking: Extracted <think> content (None if not present)
        result: VQAAnswer object containing the answer
    """
    result: VQAAnswer

class VQAAnswer:
    """VQA answer object.
    
    Attributes:
        answer: The answer text (minimum 1 character)
    """
    answer: str

Available fields

FieldTypeDescription
resultVQAAnswerContainer object for the answer
result.answerstrThe actual answer text to the question

Example usage

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# VQA inference
result, error = model(
    source="image.jpg",
    user_prompt="What color is the car in this image?"
)

if error is None:
    # Access VQA answer
    answer = result.result.answer
    print(f"Answer: {answer}")
    
    # Access base fields
    print(f"Question: {result.prompt}")
    if result.thinking:
        print(f"Reasoning: {result.thinking}")

Streaming mode

# VQA with streaming
stream = model(
    source="image.jpg",
    user_prompt="Describe this image in detail",
    stream=True
)

# Stream tokens
for token in stream:
    print(token, end="", flush=True)

# Get final result
result = stream.get_final_completion()
print(f"\n\nFinal answer: {result.result.answer}")

Phrase grounding response

Returned when using Phrase Grounding task type. Contains a caption with bounding boxes for detected objects.

Schema structure

class PhraseGroundingResponse(PredictionResponse):
    """Phrase grounding response object.
    
    Attributes:
        prompt: The user prompt used for inference
        raw_output: Raw model output string (includes tags if COT enabled)
        thinking: Extracted <think> content (None if not present)
        result: PhraseGrounding object containing sentence and groundings
    """
    result: PhraseGrounding

class PhraseGrounding:
    """Phrase grounding object.
    
    Attributes:
        sentence: Full caption/sentence text (minimum 1 character)
        groundings: List of grounded phrases with bounding boxes (minimum 1)
    """
    sentence: str
    groundings: list[GroundedPhrase]

class GroundedPhrase:
    """Text phrase with associated bounding box.
    
    Attributes:
        phrase: The text phrase (minimum 1 character)
        grounding: List of bounding boxes [xmin, ymin, xmax, ymax] in range [0, 1024]
    """
    phrase: str
    grounding: list[list[int]]  # Each box: [xmin, ymin, xmax, ymax]

Available fields

FieldTypeDescription
resultPhraseGroundingContainer object for phrase grounding results
result.sentencestrThe full caption/description text
result.groundingslist[GroundedPhrase]List of detected objects with bounding boxes
result.groundings[i].phrasestrText label for the i-th detected object
result.groundings[i].groundinglist[list[int]]List of bounding boxes for the i-th object

Bounding box format

📐

Coordinate system

Bounding boxes use normalized coordinates in range [0, 1024]:

  • Format: [x_min, y_min, x_max, y_max]
  • Top-left corner: (0, 0)
  • Bottom-right corner: (1024, 1024)
  • Independent of actual image dimensions

Convert to pixel coordinates for visualization:

from PIL import Image

image = Image.open("image.jpg")
width, height = image.size

x_min, y_min, x_max, y_max = bbox  # [0-1024] range
pixel_x_min = int(x_min / 1024 * width)
pixel_y_min = int(y_min / 1024 * height)
pixel_x_max = int(x_max / 1024 * width)
pixel_y_max = int(y_max / 1024 * height)

Tip: Use the built-in visualize_prediction() utility for automatic coordinate conversion and visualization.

Example usage

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# Phrase grounding inference
result, error = model(
    source="image.jpg",
    user_prompt="Describe the objects in this image"
)

if error is None:
    # Access the caption
    caption = result.result.sentence
    print(f"Caption: {caption}")
    
    # Access grounded phrases (detected objects)
    for grounded_phrase in result.result.groundings:
        phrase = grounded_phrase.phrase
        bboxes = grounded_phrase.grounding
        
        print(f"\nObject: {phrase}")
        print(f"  Number of bounding boxes: {len(bboxes)}")
        
        for i, bbox in enumerate(bboxes):
            x_min, y_min, x_max, y_max = bbox
            print(f"  Box {i+1}: [{x_min}, {y_min}, {x_max}, {y_max}]")

Converting to pixel coordinates

from PIL import Image

def convert_bbox_to_pixels(bbox, image_path):
    """Convert normalized bbox [0-1024] to pixel coordinates."""
    image = Image.open(image_path)
    width, height = image.size
    
    x_min, y_min, x_max, y_max = bbox
    
    return [
        int(x_min / 1024 * width),
        int(y_min / 1024 * height),
        int(x_max / 1024 * width),
        int(y_max / 1024 * height)
    ]

# Usage with phrase grounding results
result, error = model(source="image.jpg")

if error is None:
    for grounded_phrase in result.result.groundings:
        print(f"Object: {grounded_phrase.phrase}")
        
        for bbox in grounded_phrase.grounding:
            pixel_bbox = convert_bbox_to_pixels(bbox, "image.jpg")
            print(f"  Pixel coordinates: {pixel_bbox}")

Filtering groundings

# Filter groundings by object type
result, error = model(source="image.jpg")

if error is None:
    # Find all people
    people = [
        g for g in result.result.groundings 
        if "person" in g.phrase.lower()
    ]
    
    # Find all vehicles
    vehicles = [
        g for g in result.result.groundings 
        if any(v in g.phrase.lower() for v in ["car", "truck", "vehicle"])
    ]
    
    print(f"Found {len(people)} people and {len(vehicles)} vehicles")

Generic response

Returned as a fallback when:

  • Task type is explicitly set to GENERIC
  • JSON parsing fails for structured task types
  • Model output cannot be parsed into expected format

Contains the raw text output without structured parsing.

Schema structure

class GenericResponse(PredictionResponse):
    """Generic response object.
    
    Used for generic task type or when parsing fails.
    Result contains the full raw output (includes thinking and answer content if COT enabled).
    
    Attributes:
        prompt: The user prompt used for inference
        raw_output: Raw model output string (includes tags if COT enabled)
        thinking: Extracted <think> content (None if not present)
        result: The raw output string
    """
    result: str

Available fields

FieldTypeDescription
resultstrThe complete raw output text from the model

Example usage

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# Generic inference (fallback case)
result, error = model(source="image.jpg", user_prompt="Analyze this image")

if error is None:
    # Access raw result text
    output = result.result
    print(f"Model output: {output}")
    
    # Check if this is a fallback (parsing failed)
    if result.raw_output != result.result:
        print("Note: Structured parsing failed, using raw output")
⚠️

When generic responses occur

You'll receive a GenericResponse instead of a structured response when:

  • The model's JSON output is malformed or incomplete
  • The output doesn't match the expected schema
  • Task type is explicitly set to GENERIC (not common in normal usage)

Troubleshooting:

  • Check result.raw_output to see the full model output
  • Verify the model is properly trained for the task type
  • Try adjusting generation config parameters

Type checking and handling

Checking response type

Use isinstance() to determine the response type:

from vi.inference.task_types import (
    PredictionResponse,
    GenericResponse
)
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

result, error = model(source="image.jpg")

if error is None:
    if isinstance(result, VQAResponse):
        print(f"VQA Answer: {result.result.answer}")
    
    elif isinstance(result, PhraseGroundingResponse):
        print(f"Caption: {result.result.sentence}")
        print(f"Objects: {len(result.result.groundings)}")
    
    elif isinstance(result, GenericResponse):
        print(f"Raw output: {result.result}")
        print("Warning: Structured parsing may have failed")

Safe field access

Always check for field existence when handling mixed response types:

def extract_text(result):
    """Extract text from any response type."""
    if hasattr(result, 'result'):
        # Check for VQA
        if hasattr(result.result, 'answer'):
            return result.result.answer
        
        # Check for Phrase Grounding
        if hasattr(result.result, 'sentence'):
            return result.result.sentence
        
        # Generic response
        if isinstance(result.result, str):
            return result.result
    
    return None

# Usage
result, error = model(source="image.jpg")
if error is None:
    text = extract_text(result)
    print(f"Extracted text: {text}")

Batch inference schemas

When processing multiple images, results are returned as a list of (result, error) tuples:

Batch result structure

# Type signature for batch inference
def batch_inference(
    source: list[str],
    user_prompt: str
) -> list[tuple[PredictionResponse | None, Exception | None]]:
    """Returns list of (result, error) tuples."""
    ...

Example: Processing batch results

from vi.inference.task_types.vqa import VQAResponse

images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model(source=images, user_prompt="What's in this image?")

# Process all results
for i, (result, error) in enumerate(results):
    if error is None:
        # Check response type
        if isinstance(result, VQAResponse):
            print(f"Image {i+1}: {result.result.answer}")
    else:
        print(f"Image {i+1} failed: {error}")

Batch error handling

# Separate successful and failed results
successful = []
failed = []

for img, (result, error) in zip(images, results):
    if error is None:
        successful.append((img, result))
    else:
        failed.append((img, error))

print(f"Successful: {len(successful)}/{len(images)}")
print(f"Failed: {len(failed)}/{len(images)}")

# Process successful results
for img_path, result in successful:
    if isinstance(result, VQAResponse):
        print(f"{img_path}: {result.result.answer}")

Advanced usage

Accessing raw output

All responses include the raw model output before parsing:

result, error = model(source="image.jpg")

if error is None:
    # Inspect raw output (useful for debugging)
    print("=== Raw Model Output ===")
    print(result.raw_output)
    print()
    
    # Check for chain-of-thought reasoning
    if result.thinking:
        print("=== Model's Reasoning ===")
        print(result.thinking)
        print()
    
    # Access parsed result
    if isinstance(result, VQAResponse):
        print("=== Parsed Answer ===")
        print(result.result.answer)

Chain-of-thought (COT) responses

When COT is enabled, responses include the model's reasoning:

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

result, error = model(
    source="image.jpg",
    user_prompt="Count the number of cars",
    generation_config={
        "enable_cot": True  # Enable chain-of-thought
    }
)

if error is None:
    # Access the reasoning process
    if result.thinking:
        print("Model's reasoning:")
        print(result.thinking)
        print()
    
    # Access the final answer
    print("Final answer:")
    print(result.result.answer)

Exporting schemas to JSON

Convert response objects to JSON for storage or analysis:

import json

result, error = model(source="image.jpg")

if error is None:
    # For VQA responses
    if isinstance(result, VQAResponse):
        output = {
            "type": "vqa",
            "prompt": result.prompt,
            "answer": result.result.answer,
            "thinking": result.thinking
        }
    
    # For Phrase Grounding responses
    elif isinstance(result, PhraseGroundingResponse):
        output = {
            "type": "phrase_grounding",
            "prompt": result.prompt,
            "sentence": result.result.sentence,
            "objects": [
                {
                    "phrase": g.phrase,
                    "bounding_boxes": g.grounding
                }
                for g in result.result.groundings
            ],
            "thinking": result.thinking
        }
    
    # Save to file
    with open("result.json", "w") as f:
        json.dump(output, f, indent=2)

Common patterns

Universal text extraction

Extract text from any response type:

def get_text_output(result):
    """Get text output regardless of response type."""
    if isinstance(result, VQAResponse):
        return result.result.answer
    
    elif isinstance(result, PhraseGroundingResponse):
        return result.result.sentence
    
    elif isinstance(result, GenericResponse):
        return result.result
    
    return None

# Usage
result, error = model(source="image.jpg")
if error is None:
    text = get_text_output(result)
    print(text)

Count objects in phrase grounding

def count_objects(result):
    """Count detected objects in phrase grounding result."""
    if isinstance(result, PhraseGroundingResponse):
        return len(result.result.groundings)
    return 0

# Usage
result, error = model(source="image.jpg")
if error is None:
    num_objects = count_objects(result)
    print(f"Detected {num_objects} objects")

Filter by confidence (when available)

# Some models may include confidence scores in the future
# This is a forward-compatible pattern

def filter_high_confidence(result, threshold=0.5):
    """Filter groundings by confidence threshold."""
    if not isinstance(result, PhraseGroundingResponse):
        return result
    
    # Currently all groundings are included
    # In future versions, you might filter by confidence
    return result

Related resources