Handle Results

Handle results

Access, process, and visualize inference results including captions and bounding boxes from VLM predictions.

📋

Prerequisites

Learn how to run inference →

Overview

Inference results from the Vi SDK contain:

  • Captions — Text descriptions or answers from VQA tasks
  • Grounded phrases — Detected objects with bounding boxes from phrase grounding
  • Structured data — Organized, predictable format for easy processing

View complete prediction schemas →


Accessing results

Basic result access

Results are returned as (result, error) tuples by default (non-streaming mode):

from vi.inference import ViModel
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

model = ViModel(run_id="your-run-id")

result, error = model(
    source="image.jpg",
    user_prompt="Describe this image"
)

if error is None:
    # Access text based on response type
    if isinstance(result, VQAResponse):
        print(f"Answer: {result.result.answer}")
    elif isinstance(result, PhraseGroundingResponse):
        print(f"Caption: {result.result.sentence}")
else:
    print(f"Error: {error}")

See complete schema reference →

Checking available fields

Always check response type to access the correct fields:

from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

result, error = model(source="image.jpg")

if error is None:
    # Check for VQA response
    if isinstance(result, VQAResponse):
        print(f"Answer: {result.result.answer}")

    # Check for Phrase Grounding response
    elif isinstance(result, PhraseGroundingResponse):
        print(f"Caption: {result.result.sentence}")
        print(f"Found {len(result.result.groundings)} objects")
        for grounding in result.result.groundings:
            print(f"  - {grounding.phrase}: {grounding.grounding}")

Learn about all response types →


Working with captions

Basic caption access

Access text responses from visual question answering and phrase grounding tasks:

from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

result, error = model(
    source="image.jpg",
    user_prompt="What's in this image?"
)

if error is None:
    # VQA responses have an answer field
    if isinstance(result, VQAResponse):
        text = result.result.answer
        print(f"Answer: {text}")

    # Phrase Grounding responses have a sentence field
    elif isinstance(result, PhraseGroundingResponse):
        text = result.result.sentence
        print(f"Caption: {text}")

See all available fields →

Caption processing

Process and analyze VQA and phrase grounding responses:

from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

# Get text from response
if isinstance(result, VQAResponse):
    text = result.result.answer
elif isinstance(result, PhraseGroundingResponse):
    text = result.result.sentence
else:
    text = result.result  # Generic response

# Convert to lowercase for comparison
text_lower = text.lower()

# Check for keywords
if "car" in text_lower:
    print("Car detected in image")

# Extract information
words = text.split()
word_count = len(words)
print(f"Text length: {word_count} words")

Save captions to file

Export inference results for later analysis:

import json
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

# Helper function to extract text
def get_text(result):
    if isinstance(result, VQAResponse):
        return result.result.answer
    elif isinstance(result, PhraseGroundingResponse):
        return result.result.sentence
    else:
        return result.result

# Save single result
result, error = model(source="image.jpg")
if error is None:
    with open("result.json", "w") as f:
        json.dump({"text": get_text(result)}, f, indent=2)

# Save multiple results
results = model(source="./images/")

texts = []
for result, error in results:
    if error is None:
        texts.append(get_text(result))

with open("outputs.txt", "w") as f:
    f.write("\n".join(texts))

Working with grounded phrases

Accessing grounded phrases

Extract object detections from phrase grounding results:

from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

result, error = model(source="image.jpg")

if error is None and isinstance(result, PhraseGroundingResponse):
    for grounding in result.result.groundings:
        print(f"Phrase: {grounding.phrase}")
        print(f"Bounding boxes: {grounding.grounding}")

View complete phrase grounding schema →

Bounding box format

📐

Coordinate system

Bounding boxes from phrase grounding use normalized coordinates [0, 1024]:

  • Format: [x_min, y_min, x_max, y_max]
  • Top-left corner: (0, 0)
  • Bottom-right corner: (1024, 1024)
  • Independent of actual image dimensions

Convert to pixel coordinates for visualization and processing.

# Example bbox: [100, 200, 500, 600]
# x_min=100, y_min=200, x_max=500, y_max=600

Converting to pixel coordinates

Convert normalized bounding boxes to actual pixel coordinates for visualization:

from PIL import Image
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

def bbox_to_pixels(bbox, image_path):
    """Convert normalized bbox [0-1024] to pixel coordinates."""
    image = Image.open(image_path)
    width, height = image.size

    x_min, y_min, x_max, y_max = bbox

    return [
        int(x_min / 1024 * width),
        int(y_min / 1024 * height),
        int(x_max / 1024 * width),
        int(y_max / 1024 * height)
    ]

# Usage
result, error = model(source="image.jpg")
if error is None and isinstance(result, PhraseGroundingResponse):
    for grounding in result.result.groundings:
        # Each grounding can have multiple bounding boxes
        for bbox in grounding.grounding:
            pixel_bbox = bbox_to_pixels(bbox, "image.jpg")
            print(f"{grounding.phrase}: {pixel_bbox}")

Filter by object type

Filter grounded phrases by category or attribute:

from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

result, error = model(source="image.jpg")

if error is None and isinstance(result, PhraseGroundingResponse):
    # Filter for specific objects
    people = [g for g in result.result.groundings if "person" in g.phrase.lower()]
    vehicles = [g for g in result.result.groundings if any(v in g.phrase.lower() for v in ["car", "truck", "vehicle"])]

    print(f"Found {len(people)} people and {len(vehicles)} vehicles")

Visualization

Built-in visualization

The Vi SDK provides a visualize_prediction() utility function that automatically renders predictions with bounding boxes, labels, and captions:

from vi.inference import ViModel
from vi.inference.utils.visualize import visualize_prediction
from pathlib import Path

# Run inference
model = ViModel(run_id="your-run-id")
result, error = model(source="image.jpg")

if error is None:
    # Visualize the prediction
    image = visualize_prediction(
        image_path=Path("image.jpg"),
        prediction=result
    )

    # Display the result
    image.show()

    # Save the visualization
    image.save("prediction_visualization.png")

Automatic visualization features

The built-in visualize_prediction() function automatically handles:

  • Bounding boxes with labeled phrases for phrase grounding
  • Question and answer panels for VQA tasks
  • Coordinate conversion from [0, 1024] to pixel space
  • Text wrapping for long captions and labels
  • Optimal font sizing based on image dimensions

Only works with PhraseGroundingResponse and VQAResponse prediction types. Does not work with GenericResponse.

Supported prediction types:

  • PhraseGroundingResponse — Draws bounding boxes with phrase labels and displays the sentence caption
  • VQAResponse — Creates a side panel showing the question and answer
⚠️

GenericResponse not supported

The visualize_prediction() function will not work with GenericResponse predictions. GenericResponse indicates that the model output could not be parsed into a structured format (PhraseGrounding or VQA), so there's no standardized way to visualize it.

If you need to visualize GenericResponse outputs, you'll need to implement custom visualization logic based on your specific output format.

Custom visualization

For users who want to customize visualization styling, layout, colors, or behavior beyond what the built-in utility provides, you can implement your own visualization using PIL/Pillow, OpenCV, or matplotlib:

from PIL import Image, ImageDraw, ImageFont
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

def visualize_result(image_path, result, output_path="output.jpg"):
    """Visualize inference result with bounding boxes."""
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)
    width, height = image.size

    # Draw grounded phrases
    if isinstance(result, PhraseGroundingResponse):
        for grounding in result.result.groundings:
            # Each grounding can have multiple bounding boxes
            for bbox in grounding.grounding:
                # Convert to pixel coordinates
                x_min = bbox[0] / 1024 * width
                y_min = bbox[1] / 1024 * height
                x_max = bbox[2] / 1024 * width
                y_max = bbox[3] / 1024 * height

                # Draw rectangle
                draw.rectangle(
                    [(x_min, y_min), (x_max, y_max)],
                    outline='red',
                    width=3
                )

                # Draw label
                draw.text(
                    (x_min, y_min - 10),
                    grounding.phrase,
                    fill='red'
                )

        # Add caption at top
        draw.text((10, 10), result.result.sentence[:100], fill='white')

    # Save
    image.save(output_path)
    print(f"Saved visualization to {output_path}")

# Usage
result, error = model(source="image.jpg")
if error is None:
    visualize_result("image.jpg", result, "output.jpg")

Color-coded visualization (custom)

Example of custom visualization with different colors for each object:

from PIL import Image, ImageDraw
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

def visualize_with_colors(image_path, result, output_path="output.jpg"):
    """Visualize with different colors for each object."""
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)
    width, height = image.size

    if isinstance(result, PhraseGroundingResponse):
        colors = ['red', 'blue', 'green', 'yellow', 'purple', 'orange']

        for i, grounding in enumerate(result.result.groundings):
            color = colors[i % len(colors)]

            # Draw each bounding box for this phrase
            for bbox in grounding.grounding:
                # Convert bbox
                x_min = bbox[0] / 1024 * width
                y_min = bbox[1] / 1024 * height
                x_max = bbox[2] / 1024 * width
                y_max = bbox[3] / 1024 * height

                # Draw with color
                draw.rectangle(
                    [(x_min, y_min), (x_max, y_max)],
                    outline=color,
                    width=3
                )

                draw.text(
                    (x_min, y_min - 10),
                    grounding.phrase,
                    fill=color
                )

    image.save(output_path)

# Usage
result, error = model(source="image.jpg")
if error is None:
    visualize_with_colors("image.jpg", result, "colored_output.jpg")
🎨

Visualization tips

  • Use visualize_prediction() for quick, automatic visualization (works with PhraseGroundingResponse and VQAResponse only)
  • Implement custom visualization with PIL/Pillow, OpenCV, or matplotlib when you need:
    • Custom colors, fonts, or styling
    • Different layouts or arrangements
    • GenericResponse outputs
    • Additional overlays or metadata
  • Add confidence scores or other metadata to labels in custom implementations

See common workflows for more examples →


Exporting results

Export to JSON

Export inference results to JSON format:

import json
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

def export_to_json(image_path, result, output_path="result.json"):
    """Export result to JSON format."""
    data = {
        "image": image_path,
        "text": None,
        "objects": []
    }

    # Extract text based on response type
    if isinstance(result, VQAResponse):
        data["text"] = result.result.answer
        data["type"] = "vqa"
    elif isinstance(result, PhraseGroundingResponse):
        data["text"] = result.result.sentence
        data["type"] = "phrase_grounding"

        # Add grounded phrases
        for grounding in result.result.groundings:
            data["objects"].append({
                "phrase": grounding.phrase,
                "bounding_boxes": grounding.grounding
            })
    else:
        data["text"] = result.result
        data["type"] = "generic"

    with open(output_path, "w") as f:
        json.dump(data, f, indent=2)

    print(f"Exported to {output_path}")

# Usage
result, error = model(source="image.jpg")
if error is None:
    export_to_json("image.jpg", result, "result.json")

Export to CSV

Export batch inference results to CSV format:

import csv
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

def export_batch_to_csv(results, image_paths, output_path="results.csv"):
    """Export batch results to CSV."""
    with open(output_path, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Image", "Text", "Object Count", "Objects"])

        for img_path, (result, error) in zip(image_paths, results):
            if error is None:
                # Extract text based on response type
                if isinstance(result, VQAResponse):
                    text = result.result.answer
                    obj_count = 0
                    objects = []
                elif isinstance(result, PhraseGroundingResponse):
                    text = result.result.sentence
                    obj_count = len(result.result.groundings)
                    objects = [g.phrase for g in result.result.groundings]
                else:
                    text = result.result
                    obj_count = 0
                    objects = []

                writer.writerow([
                    img_path,
                    text,
                    obj_count,
                    "; ".join(objects)
                ])

    print(f"Exported to {output_path}")

# Usage
results = model(source="./images/")
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
export_batch_to_csv(results, image_paths, "results.csv")

Common workflows

Dataset annotation workflow

Generate annotations for unlabeled images using inference:

import json
from pathlib import Path
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

def annotate_dataset(model, image_dir, output_file):
    """Generate annotations for unlabeled images."""
    results = model(
        source=image_dir,
        user_prompt="Describe this image concisely",
        recursive=True,
        show_progress=True
    )

    annotations = []
    for result, error in results:
        if error is not None:
            continue

        annotation = {}

        # Extract text based on response type
        if isinstance(result, VQAResponse):
            annotation["text"] = result.result.answer
            annotation["type"] = "vqa"
        elif isinstance(result, PhraseGroundingResponse):
            annotation["text"] = result.result.sentence
            annotation["type"] = "phrase_grounding"
            annotation["objects"] = [
                {
                    "phrase": g.phrase,
                    "bounding_boxes": g.grounding
                }
                for g in result.result.groundings
            ]
        else:
            annotation["text"] = result.result
            annotation["type"] = "generic"

        annotations.append(annotation)

    with open(output_file, 'w') as f:
        json.dump(annotations, f, indent=2)

    print(f"Generated {len(annotations)} annotations")

# Usage
annotate_dataset(model, "./images", "annotations.json")

Quality control workflow

Validate model predictions against expected outputs:

from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

def validate_predictions(model, test_cases):
    """Validate model predictions against expected outputs."""
    results = []

    for test in test_cases:
        result, error = model(
            source=test["image"],
            user_prompt=test["prompt"]
        )

        if error is None:
            # Extract text based on response type
            if isinstance(result, VQAResponse):
                prediction_text = result.result.answer
            elif isinstance(result, PhraseGroundingResponse):
                prediction_text = result.result.sentence
            else:
                prediction_text = result.result

            prediction = prediction_text.lower()
            expected = test["expected"].lower()
            match = expected in prediction

            results.append({
                "image": test["image"],
                "prediction": prediction_text,
                "expected": test["expected"],
                "match": match
            })
        else:
            results.append({
                "image": test["image"],
                "error": str(error),
                "match": False
            })

    # Calculate accuracy
    matches = sum(1 for r in results if r.get("match", False))
    accuracy = matches / len(results) if results else 0

    print(f"Accuracy: {accuracy:.2%}")
    return results

# Usage
test_cases = [
    {"image": "defect1.jpg", "prompt": "Any defects?", "expected": "defect"},
    {"image": "good1.jpg", "prompt": "Any defects?", "expected": "no defect"}
]
validation_results = validate_predictions(model, test_cases)

Related resources