Prediction Result Schemas
Prediction result schemas
Reference documentation for all prediction result schemas returned by different task types in the Vi SDK.
Understanding result schemasEach task type returns a specific response structure:
- VQA (Visual Question Answering) — Text answers to questions
- Phrase Grounding — Captions with bounding boxes
- Generic — Raw text output for fallback cases
All responses include common fields like
prompt,raw_output, andthinking.
Overview
Prediction responses inherit from a base PredictionResponse class and add task-specific fields. Understanding these schemas helps you:
- Access the right fields — Know which properties are available for each task type
- Handle different task types — Write robust code that works with all response types
- Debug issues — Inspect
raw_outputandthinkingfields when needed - Parse results correctly — Extract structured data from predictions
Base response fields
All prediction responses include these base fields from PredictionResponse:
Common fields
| Field | Type | Description |
|---|---|---|
prompt | str | The user prompt used for this prediction |
raw_output | str | None | Raw model output string before parsing (includes <think> and <answer> tags if COT enabled) |
thinking | str | None | Extracted content from <think>...</think> tags (available only when chain-of-thought is enabled) |
Example: Accessing base fields
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
result, error = model(source="image.jpg", user_prompt="Describe this image")
if error is None:
# Access base fields (available on all response types)
print(f"Prompt: {result.prompt}")
print(f"Raw output length: {len(result.raw_output) if result.raw_output else 0}")
# Check for chain-of-thought reasoning
if result.thinking:
print(f"Model's reasoning: {result.thinking}")VQA response
Returned when using Visual Question Answering task type. Contains a text answer to the user's question.
Schema structure
class VQAResponse(PredictionResponse):
"""VQA response object.
Attributes:
prompt: The user prompt used for inference
raw_output: Raw model output string (includes tags if COT enabled)
thinking: Extracted <think> content (None if not present)
result: VQAAnswer object containing the answer
"""
result: VQAAnswer
class VQAAnswer:
"""VQA answer object.
Attributes:
answer: The answer text (minimum 1 character)
"""
answer: strAvailable fields
| Field | Type | Description |
|---|---|---|
result | VQAAnswer | Container object for the answer |
result.answer | str | The actual answer text to the question |
Example usage
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
# VQA inference
result, error = model(
source="image.jpg",
user_prompt="What color is the car in this image?"
)
if error is None:
# Access VQA answer
answer = result.result.answer
print(f"Answer: {answer}")
# Access base fields
print(f"Question: {result.prompt}")
if result.thinking:
print(f"Reasoning: {result.thinking}")Streaming mode
# VQA with streaming
stream = model(
source="image.jpg",
user_prompt="Describe this image in detail",
stream=True
)
# Stream tokens
for token in stream:
print(token, end="", flush=True)
# Get final result
result = stream.get_final_completion()
print(f"\n\nFinal answer: {result.result.answer}")Phrase grounding response
Returned when using Phrase Grounding task type. Contains a caption with bounding boxes for detected objects.
Schema structure
class PhraseGroundingResponse(PredictionResponse):
"""Phrase grounding response object.
Attributes:
prompt: The user prompt used for inference
raw_output: Raw model output string (includes tags if COT enabled)
thinking: Extracted <think> content (None if not present)
result: PhraseGrounding object containing sentence and groundings
"""
result: PhraseGrounding
class PhraseGrounding:
"""Phrase grounding object.
Attributes:
sentence: Full caption/sentence text (minimum 1 character)
groundings: List of grounded phrases with bounding boxes (minimum 1)
"""
sentence: str
groundings: list[GroundedPhrase]
class GroundedPhrase:
"""Text phrase with associated bounding box.
Attributes:
phrase: The text phrase (minimum 1 character)
grounding: List of bounding boxes [xmin, ymin, xmax, ymax] in range [0, 1024]
"""
phrase: str
grounding: list[list[int]] # Each box: [xmin, ymin, xmax, ymax]Available fields
| Field | Type | Description |
|---|---|---|
result | PhraseGrounding | Container object for phrase grounding results |
result.sentence | str | The full caption/description text |
result.groundings | list[GroundedPhrase] | List of detected objects with bounding boxes |
result.groundings[i].phrase | str | Text label for the i-th detected object |
result.groundings[i].grounding | list[list[int]] | List of bounding boxes for the i-th object |
Bounding box format
Coordinate systemBounding boxes use normalized coordinates in range
[0, 1024]:
- Format:
[x_min, y_min, x_max, y_max]- Top-left corner:
(0, 0)- Bottom-right corner:
(1024, 1024)- Independent of actual image dimensions
Convert to pixel coordinates for visualization:
from PIL import Image image = Image.open("image.jpg") width, height = image.size x_min, y_min, x_max, y_max = bbox # [0-1024] range pixel_x_min = int(x_min / 1024 * width) pixel_y_min = int(y_min / 1024 * height) pixel_x_max = int(x_max / 1024 * width) pixel_y_max = int(y_max / 1024 * height)Tip: Use the built-in
visualize_prediction()utility for automatic coordinate conversion and visualization.
Example usage
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
# Phrase grounding inference
result, error = model(
source="image.jpg",
user_prompt="Describe the objects in this image"
)
if error is None:
# Access the caption
caption = result.result.sentence
print(f"Caption: {caption}")
# Access grounded phrases (detected objects)
for grounded_phrase in result.result.groundings:
phrase = grounded_phrase.phrase
bboxes = grounded_phrase.grounding
print(f"\nObject: {phrase}")
print(f" Number of bounding boxes: {len(bboxes)}")
for i, bbox in enumerate(bboxes):
x_min, y_min, x_max, y_max = bbox
print(f" Box {i+1}: [{x_min}, {y_min}, {x_max}, {y_max}]")Converting to pixel coordinates
from PIL import Image
def convert_bbox_to_pixels(bbox, image_path):
"""Convert normalized bbox [0-1024] to pixel coordinates."""
image = Image.open(image_path)
width, height = image.size
x_min, y_min, x_max, y_max = bbox
return [
int(x_min / 1024 * width),
int(y_min / 1024 * height),
int(x_max / 1024 * width),
int(y_max / 1024 * height)
]
# Usage with phrase grounding results
result, error = model(source="image.jpg")
if error is None:
for grounded_phrase in result.result.groundings:
print(f"Object: {grounded_phrase.phrase}")
for bbox in grounded_phrase.grounding:
pixel_bbox = convert_bbox_to_pixels(bbox, "image.jpg")
print(f" Pixel coordinates: {pixel_bbox}")Filtering groundings
# Filter groundings by object type
result, error = model(source="image.jpg")
if error is None:
# Find all people
people = [
g for g in result.result.groundings
if "person" in g.phrase.lower()
]
# Find all vehicles
vehicles = [
g for g in result.result.groundings
if any(v in g.phrase.lower() for v in ["car", "truck", "vehicle"])
]
print(f"Found {len(people)} people and {len(vehicles)} vehicles")Generic response
Returned as a fallback when:
- Task type is explicitly set to
GENERIC - JSON parsing fails for structured task types
- Model output cannot be parsed into expected format
Contains the raw text output without structured parsing.
Schema structure
class GenericResponse(PredictionResponse):
"""Generic response object.
Used for generic task type or when parsing fails.
Result contains the full raw output (includes thinking and answer content if COT enabled).
Attributes:
prompt: The user prompt used for inference
raw_output: Raw model output string (includes tags if COT enabled)
thinking: Extracted <think> content (None if not present)
result: The raw output string
"""
result: strAvailable fields
| Field | Type | Description |
|---|---|---|
result | str | The complete raw output text from the model |
Example usage
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
# Generic inference (fallback case)
result, error = model(source="image.jpg", user_prompt="Analyze this image")
if error is None:
# Access raw result text
output = result.result
print(f"Model output: {output}")
# Check if this is a fallback (parsing failed)
if result.raw_output != result.result:
print("Note: Structured parsing failed, using raw output")
When generic responses occurYou'll receive a
GenericResponseinstead of a structured response when:
- The model's JSON output is malformed or incomplete
- The output doesn't match the expected schema
- Task type is explicitly set to
GENERIC(not common in normal usage)Troubleshooting:
- Check
result.raw_outputto see the full model output- Verify the model is properly trained for the task type
- Try adjusting generation config parameters
Type checking and handling
Checking response type
Use isinstance() to determine the response type:
from vi.inference.task_types import (
PredictionResponse,
GenericResponse
)
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
result, error = model(source="image.jpg")
if error is None:
if isinstance(result, VQAResponse):
print(f"VQA Answer: {result.result.answer}")
elif isinstance(result, PhraseGroundingResponse):
print(f"Caption: {result.result.sentence}")
print(f"Objects: {len(result.result.groundings)}")
elif isinstance(result, GenericResponse):
print(f"Raw output: {result.result}")
print("Warning: Structured parsing may have failed")Safe field access
Always check for field existence when handling mixed response types:
def extract_text(result):
"""Extract text from any response type."""
if hasattr(result, 'result'):
# Check for VQA
if hasattr(result.result, 'answer'):
return result.result.answer
# Check for Phrase Grounding
if hasattr(result.result, 'sentence'):
return result.result.sentence
# Generic response
if isinstance(result.result, str):
return result.result
return None
# Usage
result, error = model(source="image.jpg")
if error is None:
text = extract_text(result)
print(f"Extracted text: {text}")Batch inference schemas
When processing multiple images, results are returned as a list of (result, error) tuples:
Batch result structure
# Type signature for batch inference
def batch_inference(
source: list[str],
user_prompt: str
) -> list[tuple[PredictionResponse | None, Exception | None]]:
"""Returns list of (result, error) tuples."""
...Example: Processing batch results
from vi.inference.task_types.vqa import VQAResponse
images = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model(source=images, user_prompt="What's in this image?")
# Process all results
for i, (result, error) in enumerate(results):
if error is None:
# Check response type
if isinstance(result, VQAResponse):
print(f"Image {i+1}: {result.result.answer}")
else:
print(f"Image {i+1} failed: {error}")Batch error handling
# Separate successful and failed results
successful = []
failed = []
for img, (result, error) in zip(images, results):
if error is None:
successful.append((img, result))
else:
failed.append((img, error))
print(f"Successful: {len(successful)}/{len(images)}")
print(f"Failed: {len(failed)}/{len(images)}")
# Process successful results
for img_path, result in successful:
if isinstance(result, VQAResponse):
print(f"{img_path}: {result.result.answer}")Advanced usage
Accessing raw output
All responses include the raw model output before parsing:
result, error = model(source="image.jpg")
if error is None:
# Inspect raw output (useful for debugging)
print("=== Raw Model Output ===")
print(result.raw_output)
print()
# Check for chain-of-thought reasoning
if result.thinking:
print("=== Model's Reasoning ===")
print(result.thinking)
print()
# Access parsed result
if isinstance(result, VQAResponse):
print("=== Parsed Answer ===")
print(result.result.answer)Chain-of-thought (COT) responses
When COT is enabled, responses include the model's reasoning:
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
result, error = model(
source="image.jpg",
user_prompt="Count the number of cars",
generation_config={
"enable_cot": True # Enable chain-of-thought
}
)
if error is None:
# Access the reasoning process
if result.thinking:
print("Model's reasoning:")
print(result.thinking)
print()
# Access the final answer
print("Final answer:")
print(result.result.answer)Exporting schemas to JSON
Convert response objects to JSON for storage or analysis:
import json
result, error = model(source="image.jpg")
if error is None:
# For VQA responses
if isinstance(result, VQAResponse):
output = {
"type": "vqa",
"prompt": result.prompt,
"answer": result.result.answer,
"thinking": result.thinking
}
# For Phrase Grounding responses
elif isinstance(result, PhraseGroundingResponse):
output = {
"type": "phrase_grounding",
"prompt": result.prompt,
"sentence": result.result.sentence,
"objects": [
{
"phrase": g.phrase,
"bounding_boxes": g.grounding
}
for g in result.result.groundings
],
"thinking": result.thinking
}
# Save to file
with open("result.json", "w") as f:
json.dump(output, f, indent=2)Common patterns
Universal text extraction
Extract text from any response type:
def get_text_output(result):
"""Get text output regardless of response type."""
if isinstance(result, VQAResponse):
return result.result.answer
elif isinstance(result, PhraseGroundingResponse):
return result.result.sentence
elif isinstance(result, GenericResponse):
return result.result
return None
# Usage
result, error = model(source="image.jpg")
if error is None:
text = get_text_output(result)
print(text)Count objects in phrase grounding
def count_objects(result):
"""Count detected objects in phrase grounding result."""
if isinstance(result, PhraseGroundingResponse):
return len(result.result.groundings)
return 0
# Usage
result, error = model(source="image.jpg")
if error is None:
num_objects = count_objects(result)
print(f"Detected {num_objects} objects")Filter by confidence (when available)
# Some models may include confidence scores in the future
# This is a forward-compatible pattern
def filter_high_confidence(result, threshold=0.5):
"""Filter groundings by confidence threshold."""
if not isinstance(result, PhraseGroundingResponse):
return result
# Currently all groundings are included
# In future versions, you might filter by confidence
return resultRelated resources
- Run inference — Execute predictions on single images and batches
- Handle results — Process captions, bounding boxes, and visualize predictions
- Task types — VQA and phrase grounding explained
- Configure generation — Control temperature, max tokens, and sampling parameters
- Troubleshoot issues — Common problems and solutions
- Vi SDK getting started — Quick start guide for the SDK
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
