Task Types

Task types

Two supported task types for vision-language models: Visual Question Answering and Phrase Grounding.

Overview

Task types define how the model processes images and generates responses:

  • Visual Question Answering (VQA) - Answer questions about image content
  • Phrase Grounding - Detect and locate objects in images
ℹ️

Task Type Selection

The model automatically determines the appropriate task type based on your prompt and training configuration. You don't need to explicitly specify the task type.

📋

Response Schemas

Each task type returns a different response structure with specific fields. For complete details on all available fields and how to access them:

View complete prediction schemas →


Visual Question Answering (VQA)

VQA enables models to answer natural language questions about images.

What is VQA?

Visual Question Answering requires:

  • Input: An image and a question (user prompt)
  • Output: A natural language answer

The model analyzes the image and generates a contextual answer to your question.

Basic VQA Example

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# Ask a question about the image
result, error = model(
    source="image.jpg",
    user_prompt="What color is the car in this image?"  # Question required
)

if error is None:
    print(f"Answer: {result.caption}")

VQA Use Cases

Object Counting

result, error = model(
    source="crowd.jpg",
    user_prompt="How many people are visible in this image?"
)

Attribute Identification

result, error = model(
    source="product.jpg",
    user_prompt="What is the brand name on this product?"
)

Scene Understanding

result, error = model(
    source="scene.jpg",
    user_prompt="What is the main activity happening in this scene?"
)

Defect Detection

result, error = model(
    source="manufactured_part.jpg",
    user_prompt="Are there any visible defects or damage?"
)

Spatial Relationships

result, error = model(
    source="room.jpg",
    user_prompt="What is the position of the table relative to the window?"
)

VQA Prompt Guidelines

✅ Good VQA Prompts:

  • Clear and specific questions
  • Focus on observable visual elements
  • Use question words (What, Where, How many, etc.)
# Specific and clear
"What color is the car?"
"How many windows are visible?"
"Where is the person standing?"
"What type of building is this?"

❌ Avoid:

  • Vague or ambiguous prompts
  • Questions requiring external knowledge
  • Multiple questions in one prompt
# Too vague
"Tell me about this"

# Requires external knowledge
"Who is the person in this image?"

# Multiple questions (split into separate calls)
"What color is the car and how many doors does it have?"

VQA Response Structure

The VQA response includes:

result, error = model(source="image.jpg", user_prompt="What's in this image?")

if error is None:
    # Access the answer
    answer = result.caption
    print(f"Answer: {answer}")

Phrase Grounding

Phrase Grounding detects and locates objects in images with bounding boxes.

What is Phrase Grounding?

Phrase Grounding provides:

  • Input: An image and optionally a user prompt
  • Output: Detected phrases with bounding box coordinates

The model identifies objects and their locations in normalized coordinates.

Basic Phrase Grounding Example

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# Phrase grounding with custom prompt
result, error = model(
    source="image.jpg",
    user_prompt="Identify and locate all objects"  # Optional
)

# Or without custom prompt (uses default)
result, error = model(
    source="image.jpg"
)

if error is None and hasattr(result, 'grounded_phrases'):
    for phrase in result.grounded_phrases:
        print(f"Phrase: {phrase.phrase}")
        print(f"BBox: {phrase.bbox}")

Phrase Grounding Use Cases

Object Detection

# Detect all objects (default prompt)
result, error = model(
    source="scene.jpg"
)

if error is None and hasattr(result, 'grounded_phrases'):
    print(f"Found {len(result.grounded_phrases)} objects:")
    for phrase in result.grounded_phrases:
        print(f"  - {phrase.phrase} at {phrase.bbox}")

Specific Object Location

# Find specific objects
result, error = model(
    source="image.jpg",
    user_prompt="Locate all people and vehicles"
)

Safety Inspection

# Identify safety equipment
result, error = model(
    source="worksite.jpg",
    user_prompt="Identify and locate all safety equipment and protective gear"
)

Quality Control

# Detect defects with locations
result, error = model(
    source="product.jpg",
    user_prompt="Locate any defects, scratches, or imperfections"
)

Phrase Grounding Response Structure

The response includes detected phrases with bounding boxes:

result, error = model(source="image.jpg")

if error is None:
    # Caption/description
    print(f"Caption: {result.caption}")

    # Grounded phrases (if available)
    if hasattr(result, 'grounded_phrases'):
        for phrase in result.grounded_phrases:
            print(f"\nPhrase: {phrase.phrase}")
            print(f"BBox (normalized [0-1024]): {phrase.bbox}")
            # bbox format: [x_min, y_min, x_max, y_max]

Bounding Box Coordinates

Bounding boxes are in normalized coordinates [0, 1024]:

  • Format: [x_min, y_min, x_max, y_max]
  • Coordinate range: 0 to 1024
  • Top-left corner: (0, 0)
  • Bottom-right corner: (1024, 1024)

Convert to pixel coordinates:

from PIL import Image

def bbox_to_pixels(bbox, image_path):
    """Convert normalized bbox to pixel coordinates."""
    image = Image.open(image_path)
    width, height = image.size

    x_min, y_min, x_max, y_max = bbox

    return [
        int(x_min / 1024 * width),
        int(y_min / 1024 * height),
        int(x_max / 1024 * width),
        int(y_max / 1024 * height)
    ]

# Usage
result, error = model(source="image.jpg")
if error is None and hasattr(result, 'grounded_phrases'):
    for phrase in result.grounded_phrases:
        pixel_bbox = bbox_to_pixels(phrase.bbox, "image.jpg")
        print(f"{phrase.phrase}: {pixel_bbox}")
🎨

Visualize predictions easily

Use the built-in visualize_prediction() utility for automatic bounding box rendering with coordinate conversion, text wrapping, and optimal font sizing.

Learn more about result handling →

Phrase Grounding Prompt Guidelines

✅ Good Phrase Grounding Prompts:

  • Specific object categories
  • Clear detection targets
  • Optional (can omit for default behavior)
# Specific categories
"Locate all people and vehicles"
"Find all safety equipment"
"Identify furniture items"

# Detection focus
"Detect defects and damage"
"Locate text and labels"

# Default (no prompt)
result, error = model(source="image.jpg")

❌ Avoid:

  • Questions (use VQA instead)
  • Counting requests
  • Attribute queries
# Wrong task type (use VQA)
"How many cars are there?"  # Use VQA
"What color is the car?"     # Use VQA

Combining Task Types

You can use both task types in your workflow:

Sequential Analysis

# First: Phrase Grounding to detect objects
grounding_result, error = model(
    source="image.jpg",
    user_prompt="Locate all defects"
)

if error is None and hasattr(grounding_result, 'grounded_phrases'):
    print(f"Found {len(grounding_result.grounded_phrases)} defects")

    # Then: VQA for detailed analysis
    vqa_result, error = model(
        source="image.jpg",
        user_prompt="What type of defects are present?"
    )

    if error is None:
        print(f"Analysis: {vqa_result.caption}")

Verification Workflow

# 1. Detect objects
result, error = model(source="image.jpg")

if error is None and hasattr(result, 'grounded_phrases'):
    # 2. Verify specific objects
    for phrase in result.grounded_phrases:
        if "person" in phrase.phrase.lower():
            # Ask follow-up questions
            detail, err = model(
                source="image.jpg",
                user_prompt=f"What is the person at {phrase.bbox} doing?"
            )

Task Type Comparison

AspectVQAPhrase Grounding
User PromptRequired (question)Optional
Output FormatNatural language answerCaption + Bounding boxes
Primary UseUnderstanding, Q&ADetection, Localization
Example Prompt"What color is the car?""Locate all vehicles"
Response Fieldresult.result.answer (VQA)
result.result.sentence (Grounding)
result.result.groundings
Bounding BoxesNoYes

Best Practices

1. Choose the Right Task Type

Use VQA for:

  • Answering specific questions
  • Counting objects
  • Identifying attributes
  • Understanding relationships
  • Classification

Use Phrase Grounding for:

  • Detecting object locations
  • Spatial analysis
  • Quality inspection
  • Safety compliance
  • Inventory tracking

2. Craft Effective Prompts

For VQA:

# ✅ Good - specific question
"What color is the car?"

# ❌ Bad - too vague
"Tell me about this"

For Phrase Grounding:

# ✅ Good - clear detection target
"Locate all safety equipment"

# ❌ Bad - asking a question (use VQA)
"How many helmets are there?"

3. Handle Optional Prompts

For Phrase Grounding, omitting the prompt uses the default:

# With prompt - specific detection
result, error = model(
    source="image.jpg",
    user_prompt="Locate people and vehicles"
)

# Without prompt - general detection
result, error = model(
    source="image.jpg"
)

4. Validate Response Structure

Always check response type to access the correct fields:

from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

result, error = model(source="image.jpg")

if error is None:
    # Check for VQA response
    if isinstance(result, VQAResponse):
        print(f"Answer: {result.result.answer}")

    # Check for Phrase Grounding response
    elif isinstance(result, PhraseGroundingResponse):
        print(f"Caption: {result.result.sentence}")
        print(f"Found {len(result.result.groundings)} objects")
        for grounding in result.result.groundings:
            print(f"  - {grounding.phrase}")

See complete response schemas →


Common Patterns

Multi-Question Analysis

questions = [
    "What is the main subject?",
    "What is the background setting?",
    "What time of day is it?",
    "Are there any people visible?"
]

for question in questions:
    result, error = model(
        source="image.jpg",
        user_prompt=question
    )
    if error is None:
        print(f"Q: {question}")
        print(f"A: {result.caption}\n")

Object-by-Object Analysis

# First detect objects
result, error = model(source="image.jpg")

if error is None and hasattr(result, 'grounded_phrases'):
    # Analyze each detected object
    for phrase in result.grounded_phrases:
        analysis, err = model(
            source="image.jpg",
            user_prompt=f"Describe the {phrase.phrase} in detail"
        )
        if err is None:
            print(f"{phrase.phrase}: {analysis.caption}")

Conditional Analysis

# Check for specific objects first
result, error = model(
    source="image.jpg",
    user_prompt="Are there any people in this image?"
)

if error is None and "yes" in result.caption.lower():
    # If people found, locate them
    location, err = model(
        source="image.jpg",
        user_prompt="Locate all people"
    )

See also