Prediction Result Schemas

Prediction result schemas

Reference documentation for all prediction result schemas returned by different task types in the Vi SDK.

📋
Understanding result schemas
Each task type returns a specific response structure:

VQA (Visual Question Answering) — Text answers to questions

Phrase Grounding — Captions with bounding boxes

Generic — Raw text output for fallback cases

All responses include common fields like prompt, raw_output, and thinking.
Learn about task types →

Overview

Prediction responses inherit from a base PredictionResponse class and add task-specific fields. Understanding these schemas helps you:

Access the right fields — Know which properties are available for each task type
Handle different task types — Write robust code that works with all response types
Debug issues — Inspect raw_output and thinking fields when needed
Parse results correctly — Extract structured data from predictions

Base response fields

All prediction responses include these base fields from PredictionResponse:

Common fields

Field	Type	Description
`prompt`	`str`	The user prompt used for this prediction
`raw_output`	`str \| None`	Raw model output string before parsing (includes `<think>` and `<answer>` tags if COT enabled)
`thinking`	`str \| None`	Extracted content from `<think>...</think>` tags (available only when chain-of-thought is enabled)

Example: Accessing base fields

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")
result, error = model(source="image.jpg", user_prompt="Describe this image")

if error is None:
    # Access base fields (available on all response types)
    print(f"Prompt: {result.prompt}")
    print(f"Raw output length: {len(result.raw_output) if result.raw_output else 0}")
    
    # Check for chain-of-thought reasoning
    if result.thinking:
        print(f"Model's reasoning: {result.thinking}")

VQA response

Returned when using Visual Question Answering task type. Contains a text answer to the user's question.

Schema structure

class VQAResponse(PredictionResponse):
    """VQA response object.
    
    Attributes:
        prompt: The user prompt used for inference
        raw_output: Raw model output string (includes tags if COT enabled)
        thinking: Extracted <think> content (None if not present)
        result: VQAAnswer object containing the answer
    """
    result: VQAAnswer

class VQAAnswer:
    """VQA answer object.
    
    Attributes:
        answer: The answer text (minimum 1 character)
    """
    answer: str

Available fields

Field	Type	Description
`result`	`VQAAnswer`	Container object for the answer
`result.answer`	`str`	The actual answer text to the question

Example usage

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# VQA inference
result, error = model(
    source="image.jpg",
    user_prompt="What color is the car in this image?"
)

if error is None:
    # Access VQA answer
    answer = result.result.answer
    print(f"Answer: {answer}")
    
    # Access base fields
    print(f"Question: {result.prompt}")
    if result.thinking:
        print(f"Reasoning: {result.thinking}")

Streaming mode

# VQA with streaming
stream = model(
    source="image.jpg",
    user_prompt="Describe this image in detail",
    stream=True
)

# Stream tokens
for token in stream:
    print(token, end="", flush=True)

# Get final result
result = stream.get_final_completion()
print(f"\n\nFinal answer: {result.result.answer}")

Phrase grounding response

Returned when using Phrase Grounding task type. Contains a caption with bounding boxes for detected objects.

Schema structure

class PhraseGroundingResponse(PredictionResponse):
    """Phrase grounding response object.
    
    Attributes:
        prompt: The user prompt used for inference
        raw_output: Raw model output string (includes tags if COT enabled)
        thinking: Extracted <think> content (None if not present)
        result: PhraseGrounding object containing sentence and groundings
    """
    result: PhraseGrounding

class PhraseGrounding:
    """Phrase grounding object.
    
    Attributes:
        sentence: Full caption/sentence text (minimum 1 character)
        groundings: List of grounded phrases with bounding boxes (minimum 1)
    """
    sentence: str
    groundings: list[GroundedPhrase]

class GroundedPhrase:
    """Text phrase with associated bounding box.
    
    Attributes:
        phrase: The text phrase (minimum 1 character)
        grounding: List of bounding boxes [xmin, ymin, xmax, ymax] in range [0, 1024]
    """
    phrase: str
    grounding: list[list[int]]  # Each box: [xmin, ymin, xmax, ymax]

Available fields

Field	Type	Description
`result`	`PhraseGrounding`	Container object for phrase grounding results
`result.sentence`	`str`	The full caption/description text
`result.groundings`	`list[GroundedPhrase]`	List of detected objects with bounding boxes
`result.groundings[i].phrase`	`str`	Text label for the i-th detected object
`result.groundings[i].grounding`	`list[list[int]]`	List of bounding boxes for the i-th object

Bounding box format

📐
Coordinate system
Bounding boxes use normalized coordinates in range [0, 1024]:

Format: [x_min, y_min, x_max, y_max]

Top-left corner: (0, 0)

Bottom-right corner: (1024, 1024)

Independent of actual image dimensions

Convert to pixel coordinates for visualization:
from PIL import Image

image = Image.open("image.jpg")
width, height = image.size

x_min, y_min, x_max, y_max = bbox  # [0-1024] range
pixel_x_min = int(x_min / 1024 * width)
pixel_y_min = int(y_min / 1024 * height)
pixel_x_max = int(x_max / 1024 * width)
pixel_y_max = int(y_max / 1024 * height)
Tip: Use the built-in visualize_prediction() utility for automatic coordinate conversion and visualization.

Example usage

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# Phrase grounding inference
result, error = model(
    source="image.jpg",
    user_prompt="Describe the objects in this image"
)

if error is None:
    # Access the caption
    caption = result.result.sentence
    print(f"Caption: {caption}")
    
    # Access grounded phrases (detected objects)
    for grounded_phrase in result.result.groundings:
        phrase = grounded_phrase.phrase
        bboxes = grounded_phrase.grounding
        
        print(f"\nObject: {phrase}")
        print(f"  Number of bounding boxes: {len(bboxes)}")
        
        for i, bbox in enumerate(bboxes):
            x_min, y_min, x_max, y_max = bbox
            print(f"  Box {i+1}: [{x_min}, {y_min}, {x_max}, {y_max}]")

Converting to pixel coordinates

from PIL import Image

def convert_bbox_to_pixels(bbox, image_path):
    """Convert normalized bbox [0-1024] to pixel coordinates."""
    image = Image.open(image_path)
    width, height = image.size
    
    x_min, y_min, x_max, y_max = bbox
    
    return [
        int(x_min / 1024 * width),
        int(y_min / 1024 * height),
        int(x_max / 1024 * width),
        int(y_max / 1024 * height)
    ]

# Usage with phrase grounding results
result, error = model(source="image.jpg")

if error is None:
    for grounded_phrase in result.result.groundings:
        print(f"Object: {grounded_phrase.phrase}")
        
        for bbox in grounded_phrase.grounding:
            pixel_bbox = convert_bbox_to_pixels(bbox, "image.jpg")
            print(f"  Pixel coordinates: {pixel_bbox}")

Filtering groundings

# Filter groundings by object type
result, error = model(source="image.jpg")

if error is None:
    # Find all people
    people = [
        g for g in result.result.groundings 
        if "person" in g.phrase.lower()
    ]
    
    # Find all vehicles
    vehicles = [
        g for g in result.result.groundings 
        if any(v in g.phrase.lower() for v in ["car", "truck", "vehicle"])
    ]
    
    print(f"Found {len(people)} people and {len(vehicles)} vehicles")

Generic response

Returned as a fallback when:

Task type is explicitly set to GENERIC
JSON parsing fails for structured task types
Model output cannot be parsed into expected format

Contains the raw text output without structured parsing.

Schema structure

class GenericResponse(PredictionResponse):
    """Generic response object.
    
    Used for generic task type or when parsing fails.
    Result contains the full raw output (includes thinking and answer content if COT enabled).
    
    Attributes:
        prompt: The user prompt used for inference
        raw_output: Raw model output string (includes tags if COT enabled)
        thinking: Extracted <think> content (None if not present)
        result: The raw output string
    """
    result: str

Available fields

Field	Type	Description
`result`	`str`	The complete raw output text from the model

Example usage

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# Generic inference (fallback case)
result, error = model(source="image.jpg", user_prompt="Analyze this image")

if error is None:
    # Access raw result text
    output = result.result
    print(f"Model output: {output}")
    
    # Check if this is a fallback (parsing failed)
    if result.raw_output != result.result:
        print("Note: Structured parsing failed, using raw output")

⚠️
When generic responses occur
You'll receive a GenericResponse instead of a structured response when:

The model's JSON output is malformed or incomplete

The output doesn't match the expected schema

Task type is explicitly set to GENERIC (not common in normal usage)

Troubleshooting:

Check result.raw_output to see the full model output

Verify the model is properly trained for the task type

Try adjusting generation config parameters

Type checking and handling

Checking response type

Use isinstance() to determine the response type:

from vi.inference.task_types import (
    PredictionResponse,
    GenericResponse
)
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

result, error = model(source="image.jpg")

if error is None:
    if isinstance(result, VQAResponse):
        print(f"VQA Answer: {result.result.answer}")
    
    elif isinstance(result, PhraseGroundingResponse):
        print(f"Caption: {result.result.sentence}")
        print(f"Objects: {len(result.result.groundings)}")
    
    elif isinstance(result, GenericResponse):
        print(f"Raw output: {result.result}")
        print("Warning: Structured parsing may have failed")

Safe field access

Always check for field existence when handling mixed response types:

def extract_text(result):
    """Extract text from any response type."""
    if hasattr(result, 'result'):
        # Check for VQA
        if hasattr(result.result, 'answer'):
            return result.result.answer
        
        # Check for Phrase Grounding
        if hasattr(result.result, 'sentence'):
            return result.result.sentence
        
        # Generic response
        if isinstance(result.result, str):
            return result.result
    
    return None

# Usage
result, error = model(source="image.jpg")
if error is None:
    text = extract_text(result)
    print(f"Extracted text: {text}")

Batch inference schemas

When processing multiple images, results are returned as a list of (result, error) tuples:

Batch result structure

# Type signature for batch inference
def batch_inference(
    source: list[str],
    user_prompt: str
) -> list[tuple[PredictionResponse | None, Exception | None]]:
    """Returns list of (result, error) tuples."""
    ...

Example: Processing batch results

from vi.inference.task_types.vqa import VQAResponse

images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model(source=images, user_prompt="What's in this image?")

# Process all results
for i, (result, error) in enumerate(results):
    if error is None:
        # Check response type
        if isinstance(result, VQAResponse):
            print(f"Image {i+1}: {result.result.answer}")
    else:
        print(f"Image {i+1} failed: {error}")

Batch error handling

# Separate successful and failed results
successful = []
failed = []

for img, (result, error) in zip(images, results):
    if error is None:
        successful.append((img, result))
    else:
        failed.append((img, error))

print(f"Successful: {len(successful)}/{len(images)}")
print(f"Failed: {len(failed)}/{len(images)}")

# Process successful results
for img_path, result in successful:
    if isinstance(result, VQAResponse):
        print(f"{img_path}: {result.result.answer}")

Advanced usage

Accessing raw output

All responses include the raw model output before parsing:

result, error = model(source="image.jpg")

if error is None:
    # Inspect raw output (useful for debugging)
    print("=== Raw Model Output ===")
    print(result.raw_output)
    print()
    
    # Check for chain-of-thought reasoning
    if result.thinking:
        print("=== Model's Reasoning ===")
        print(result.thinking)
        print()
    
    # Access parsed result
    if isinstance(result, VQAResponse):
        print("=== Parsed Answer ===")
        print(result.result.answer)

Chain-of-thought (COT) responses

When COT is enabled, responses include the model's reasoning:

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

result, error = model(
    source="image.jpg",
    user_prompt="Count the number of cars",
    generation_config={
        "enable_cot": True  # Enable chain-of-thought
    }
)

if error is None:
    # Access the reasoning process
    if result.thinking:
        print("Model's reasoning:")
        print(result.thinking)
        print()
    
    # Access the final answer
    print("Final answer:")
    print(result.result.answer)

Exporting schemas to JSON

Convert response objects to JSON for storage or analysis:

import json

result, error = model(source="image.jpg")

if error is None:
    # For VQA responses
    if isinstance(result, VQAResponse):
        output = {
            "type": "vqa",
            "prompt": result.prompt,
            "answer": result.result.answer,
            "thinking": result.thinking
        }
    
    # For Phrase Grounding responses
    elif isinstance(result, PhraseGroundingResponse):
        output = {
            "type": "phrase_grounding",
            "prompt": result.prompt,
            "sentence": result.result.sentence,
            "objects": [
                {
                    "phrase": g.phrase,
                    "bounding_boxes": g.grounding
                }
                for g in result.result.groundings
            ],
            "thinking": result.thinking
        }
    
    # Save to file
    with open("result.json", "w") as f:
        json.dump(output, f, indent=2)

Common patterns

Universal text extraction

Extract text from any response type:

def get_text_output(result):
    """Get text output regardless of response type."""
    if isinstance(result, VQAResponse):
        return result.result.answer
    
    elif isinstance(result, PhraseGroundingResponse):
        return result.result.sentence
    
    elif isinstance(result, GenericResponse):
        return result.result
    
    return None

# Usage
result, error = model(source="image.jpg")
if error is None:
    text = get_text_output(result)
    print(text)

Count objects in phrase grounding

def count_objects(result):
    """Count detected objects in phrase grounding result."""
    if isinstance(result, PhraseGroundingResponse):
        return len(result.result.groundings)
    return 0

# Usage
result, error = model(source="image.jpg")
if error is None:
    num_objects = count_objects(result)
    print(f"Detected {num_objects} objects")

Filter by confidence (when available)

# Some models may include confidence scores in the future
# This is a forward-compatible pattern

def filter_high_confidence(result, threshold=0.5):
    """Filter groundings by confidence threshold."""
    if not isinstance(result, PhraseGroundingResponse):
        return result
    
    # Currently all groundings are included
    # In future versions, you might filter by confidence
    return result

Related resources

Run inference — Execute predictions on single images and batches
Handle results — Process captions, bounding boxes, and visualize predictions
Task types — VQA and phrase grounding explained
Configure generation — Control temperature, max tokens, and sampling parameters
Troubleshoot issues — Common problems and solutions
Vi SDK getting started — Quick start guide for the SDK

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects