Handle Results
Handle results
Access, process, and visualize inference results including captions and bounding boxes from VLM predictions.
Prerequisites
- A loaded model with completed inference
- Understanding of task types — VQA or phrase grounding
- Familiarity with streaming vs non-streaming modes
Overview
Inference results from the Vi SDK contain:
- Captions — Text descriptions or answers from VQA tasks
- Grounded phrases — Detected objects with bounding boxes from phrase grounding
- Structured data — Organized, predictable format for easy processing
View complete prediction schemas →
Accessing results
Basic result access
Results are returned as (result, error) tuples by default (non-streaming mode):
from vi.inference import ViModel
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
model = ViModel(run_id="your-run-id")
result, error = model(
source="image.jpg",
user_prompt="Describe this image"
)
if error is None:
# Access text based on response type
if isinstance(result, VQAResponse):
print(f"Answer: {result.result.answer}")
elif isinstance(result, PhraseGroundingResponse):
print(f"Caption: {result.result.sentence}")
else:
print(f"Error: {error}")See complete schema reference →
Checking available fields
Always check response type to access the correct fields:
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
result, error = model(source="image.jpg")
if error is None:
# Check for VQA response
if isinstance(result, VQAResponse):
print(f"Answer: {result.result.answer}")
# Check for Phrase Grounding response
elif isinstance(result, PhraseGroundingResponse):
print(f"Caption: {result.result.sentence}")
print(f"Found {len(result.result.groundings)} objects")
for grounding in result.result.groundings:
print(f" - {grounding.phrase}: {grounding.grounding}")Learn about all response types →
Working with captions
Basic caption access
Access text responses from visual question answering and phrase grounding tasks:
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
result, error = model(
source="image.jpg",
user_prompt="What's in this image?"
)
if error is None:
# VQA responses have an answer field
if isinstance(result, VQAResponse):
text = result.result.answer
print(f"Answer: {text}")
# Phrase Grounding responses have a sentence field
elif isinstance(result, PhraseGroundingResponse):
text = result.result.sentence
print(f"Caption: {text}")Caption processing
Process and analyze VQA and phrase grounding responses:
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
# Get text from response
if isinstance(result, VQAResponse):
text = result.result.answer
elif isinstance(result, PhraseGroundingResponse):
text = result.result.sentence
else:
text = result.result # Generic response
# Convert to lowercase for comparison
text_lower = text.lower()
# Check for keywords
if "car" in text_lower:
print("Car detected in image")
# Extract information
words = text.split()
word_count = len(words)
print(f"Text length: {word_count} words")Save captions to file
Export inference results for later analysis:
import json
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
# Helper function to extract text
def get_text(result):
if isinstance(result, VQAResponse):
return result.result.answer
elif isinstance(result, PhraseGroundingResponse):
return result.result.sentence
else:
return result.result
# Save single result
result, error = model(source="image.jpg")
if error is None:
with open("result.json", "w") as f:
json.dump({"text": get_text(result)}, f, indent=2)
# Save multiple results
results = model(source="./images/")
texts = []
for result, error in results:
if error is None:
texts.append(get_text(result))
with open("outputs.txt", "w") as f:
f.write("\n".join(texts))Working with grounded phrases
Accessing grounded phrases
Extract object detections from phrase grounding results:
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
result, error = model(source="image.jpg")
if error is None and isinstance(result, PhraseGroundingResponse):
for grounding in result.result.groundings:
print(f"Phrase: {grounding.phrase}")
print(f"Bounding boxes: {grounding.grounding}")View complete phrase grounding schema →
Bounding box format
Coordinate systemBounding boxes from phrase grounding use normalized coordinates
[0, 1024]:
- Format:
[x_min, y_min, x_max, y_max]- Top-left corner:
(0, 0)- Bottom-right corner:
(1024, 1024)- Independent of actual image dimensions
Convert to pixel coordinates for visualization and processing.
# Example bbox: [100, 200, 500, 600]
# x_min=100, y_min=200, x_max=500, y_max=600Converting to pixel coordinates
Convert normalized bounding boxes to actual pixel coordinates for visualization:
from PIL import Image
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
def bbox_to_pixels(bbox, image_path):
"""Convert normalized bbox [0-1024] to pixel coordinates."""
image = Image.open(image_path)
width, height = image.size
x_min, y_min, x_max, y_max = bbox
return [
int(x_min / 1024 * width),
int(y_min / 1024 * height),
int(x_max / 1024 * width),
int(y_max / 1024 * height)
]
# Usage
result, error = model(source="image.jpg")
if error is None and isinstance(result, PhraseGroundingResponse):
for grounding in result.result.groundings:
# Each grounding can have multiple bounding boxes
for bbox in grounding.grounding:
pixel_bbox = bbox_to_pixels(bbox, "image.jpg")
print(f"{grounding.phrase}: {pixel_bbox}")Filter by object type
Filter grounded phrases by category or attribute:
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
result, error = model(source="image.jpg")
if error is None and isinstance(result, PhraseGroundingResponse):
# Filter for specific objects
people = [g for g in result.result.groundings if "person" in g.phrase.lower()]
vehicles = [g for g in result.result.groundings if any(v in g.phrase.lower() for v in ["car", "truck", "vehicle"])]
print(f"Found {len(people)} people and {len(vehicles)} vehicles")Visualization
Built-in visualization
The Vi SDK provides a visualize_prediction() utility function that automatically renders predictions with bounding boxes, labels, and captions:
from vi.inference import ViModel
from vi.inference.utils.visualize import visualize_prediction
from pathlib import Path
# Run inference
model = ViModel(run_id="your-run-id")
result, error = model(source="image.jpg")
if error is None:
# Visualize the prediction
image = visualize_prediction(
image_path=Path("image.jpg"),
prediction=result
)
# Display the result
image.show()
# Save the visualization
image.save("prediction_visualization.png")
Automatic visualization featuresThe built-in
visualize_prediction()function automatically handles:
- Bounding boxes with labeled phrases for phrase grounding
- Question and answer panels for VQA tasks
- Coordinate conversion from
[0, 1024]to pixel space- Text wrapping for long captions and labels
- Optimal font sizing based on image dimensions
Only works with PhraseGroundingResponse and VQAResponse prediction types. Does not work with GenericResponse.
Supported prediction types:
- PhraseGroundingResponse — Draws bounding boxes with phrase labels and displays the sentence caption
- VQAResponse — Creates a side panel showing the question and answer
GenericResponse not supportedThe
visualize_prediction()function will not work with GenericResponse predictions. GenericResponse indicates that the model output could not be parsed into a structured format (PhraseGrounding or VQA), so there's no standardized way to visualize it.If you need to visualize GenericResponse outputs, you'll need to implement custom visualization logic based on your specific output format.
Custom visualization
For users who want to customize visualization styling, layout, colors, or behavior beyond what the built-in utility provides, you can implement your own visualization using PIL/Pillow, OpenCV, or matplotlib:
from PIL import Image, ImageDraw, ImageFont
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
def visualize_result(image_path, result, output_path="output.jpg"):
"""Visualize inference result with bounding boxes."""
image = Image.open(image_path)
draw = ImageDraw.Draw(image)
width, height = image.size
# Draw grounded phrases
if isinstance(result, PhraseGroundingResponse):
for grounding in result.result.groundings:
# Each grounding can have multiple bounding boxes
for bbox in grounding.grounding:
# Convert to pixel coordinates
x_min = bbox[0] / 1024 * width
y_min = bbox[1] / 1024 * height
x_max = bbox[2] / 1024 * width
y_max = bbox[3] / 1024 * height
# Draw rectangle
draw.rectangle(
[(x_min, y_min), (x_max, y_max)],
outline='red',
width=3
)
# Draw label
draw.text(
(x_min, y_min - 10),
grounding.phrase,
fill='red'
)
# Add caption at top
draw.text((10, 10), result.result.sentence[:100], fill='white')
# Save
image.save(output_path)
print(f"Saved visualization to {output_path}")
# Usage
result, error = model(source="image.jpg")
if error is None:
visualize_result("image.jpg", result, "output.jpg")Color-coded visualization (custom)
Example of custom visualization with different colors for each object:
from PIL import Image, ImageDraw
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
def visualize_with_colors(image_path, result, output_path="output.jpg"):
"""Visualize with different colors for each object."""
image = Image.open(image_path)
draw = ImageDraw.Draw(image)
width, height = image.size
if isinstance(result, PhraseGroundingResponse):
colors = ['red', 'blue', 'green', 'yellow', 'purple', 'orange']
for i, grounding in enumerate(result.result.groundings):
color = colors[i % len(colors)]
# Draw each bounding box for this phrase
for bbox in grounding.grounding:
# Convert bbox
x_min = bbox[0] / 1024 * width
y_min = bbox[1] / 1024 * height
x_max = bbox[2] / 1024 * width
y_max = bbox[3] / 1024 * height
# Draw with color
draw.rectangle(
[(x_min, y_min), (x_max, y_max)],
outline=color,
width=3
)
draw.text(
(x_min, y_min - 10),
grounding.phrase,
fill=color
)
image.save(output_path)
# Usage
result, error = model(source="image.jpg")
if error is None:
visualize_with_colors("image.jpg", result, "colored_output.jpg")
Visualization tips
- Use
visualize_prediction()for quick, automatic visualization (works with PhraseGroundingResponse and VQAResponse only)- Implement custom visualization with PIL/Pillow, OpenCV, or matplotlib when you need:
- Custom colors, fonts, or styling
- Different layouts or arrangements
- GenericResponse outputs
- Additional overlays or metadata
- Add confidence scores or other metadata to labels in custom implementations
Exporting results
Export to JSON
Export inference results to JSON format:
import json
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
def export_to_json(image_path, result, output_path="result.json"):
"""Export result to JSON format."""
data = {
"image": image_path,
"text": None,
"objects": []
}
# Extract text based on response type
if isinstance(result, VQAResponse):
data["text"] = result.result.answer
data["type"] = "vqa"
elif isinstance(result, PhraseGroundingResponse):
data["text"] = result.result.sentence
data["type"] = "phrase_grounding"
# Add grounded phrases
for grounding in result.result.groundings:
data["objects"].append({
"phrase": grounding.phrase,
"bounding_boxes": grounding.grounding
})
else:
data["text"] = result.result
data["type"] = "generic"
with open(output_path, "w") as f:
json.dump(data, f, indent=2)
print(f"Exported to {output_path}")
# Usage
result, error = model(source="image.jpg")
if error is None:
export_to_json("image.jpg", result, "result.json")Export to CSV
Export batch inference results to CSV format:
import csv
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
def export_batch_to_csv(results, image_paths, output_path="results.csv"):
"""Export batch results to CSV."""
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Image", "Text", "Object Count", "Objects"])
for img_path, (result, error) in zip(image_paths, results):
if error is None:
# Extract text based on response type
if isinstance(result, VQAResponse):
text = result.result.answer
obj_count = 0
objects = []
elif isinstance(result, PhraseGroundingResponse):
text = result.result.sentence
obj_count = len(result.result.groundings)
objects = [g.phrase for g in result.result.groundings]
else:
text = result.result
obj_count = 0
objects = []
writer.writerow([
img_path,
text,
obj_count,
"; ".join(objects)
])
print(f"Exported to {output_path}")
# Usage
results = model(source="./images/")
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
export_batch_to_csv(results, image_paths, "results.csv")Common workflows
Dataset annotation workflow
Generate annotations for unlabeled images using inference:
import json
from pathlib import Path
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
def annotate_dataset(model, image_dir, output_file):
"""Generate annotations for unlabeled images."""
results = model(
source=image_dir,
user_prompt="Describe this image concisely",
recursive=True,
show_progress=True
)
annotations = []
for result, error in results:
if error is not None:
continue
annotation = {}
# Extract text based on response type
if isinstance(result, VQAResponse):
annotation["text"] = result.result.answer
annotation["type"] = "vqa"
elif isinstance(result, PhraseGroundingResponse):
annotation["text"] = result.result.sentence
annotation["type"] = "phrase_grounding"
annotation["objects"] = [
{
"phrase": g.phrase,
"bounding_boxes": g.grounding
}
for g in result.result.groundings
]
else:
annotation["text"] = result.result
annotation["type"] = "generic"
annotations.append(annotation)
with open(output_file, 'w') as f:
json.dump(annotations, f, indent=2)
print(f"Generated {len(annotations)} annotations")
# Usage
annotate_dataset(model, "./images", "annotations.json")Quality control workflow
Validate model predictions against expected outputs:
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
def validate_predictions(model, test_cases):
"""Validate model predictions against expected outputs."""
results = []
for test in test_cases:
result, error = model(
source=test["image"],
user_prompt=test["prompt"]
)
if error is None:
# Extract text based on response type
if isinstance(result, VQAResponse):
prediction_text = result.result.answer
elif isinstance(result, PhraseGroundingResponse):
prediction_text = result.result.sentence
else:
prediction_text = result.result
prediction = prediction_text.lower()
expected = test["expected"].lower()
match = expected in prediction
results.append({
"image": test["image"],
"prediction": prediction_text,
"expected": test["expected"],
"match": match
})
else:
results.append({
"image": test["image"],
"error": str(error),
"match": False
})
# Calculate accuracy
matches = sum(1 for r in results if r.get("match", False))
accuracy = matches / len(results) if results else 0
print(f"Accuracy: {accuracy:.2%}")
return results
# Usage
test_cases = [
{"image": "defect1.jpg", "prompt": "Any defects?", "expected": "defect"},
{"image": "good1.jpg", "prompt": "Any defects?", "expected": "no defect"}
]
validation_results = validate_predictions(model, test_cases)Related resources
- Prediction schemas — Complete reference for all response types and fields
- Inference overview — Getting started with Vi SDK inference
- Run inference — Execute predictions on single images and batches
- Task types — VQA and phrase grounding explained
- Load models — Load models from Datature Vi or HuggingFace
- Configure generation — Control output parameters for better results
- Optimize performance — Memory management and GPU utilization
- Evaluate a model — Assess model performance with metrics
- Vi SDK getting started — Quick start guide for the SDK
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
