Task Types

Task types

Two supported task types for vision-language models: Visual Question Answering and Phrase Grounding.

Overview

Task types define how the model processes images and generates responses:

Visual Question Answering (VQA) - Answer questions about image content
Phrase Grounding - Detect and locate objects in images

ℹ️
Task Type Selection
The model automatically determines the appropriate task type based on your prompt and training configuration. You don't need to explicitly specify the task type.

📋
Response Schemas
Each task type returns a different response structure with specific fields. For complete details on all available fields and how to access them:
View complete prediction schemas →

Visual Question Answering (VQA)

VQA enables models to answer natural language questions about images.

What is VQA?

Visual Question Answering requires:

Input: An image and a question (user prompt)
Output: A natural language answer

The model analyzes the image and generates a contextual answer to your question.

Basic VQA Example

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# Ask a question about the image
result, error = model(
    source="image.jpg",
    user_prompt="What color is the car in this image?"  # Question required
)

if error is None:
    print(f"Answer: {result.caption}")

VQA Use Cases

Object Counting

result, error = model(
    source="crowd.jpg",
    user_prompt="How many people are visible in this image?"
)

Attribute Identification

result, error = model(
    source="product.jpg",
    user_prompt="What is the brand name on this product?"
)

Scene Understanding

result, error = model(
    source="scene.jpg",
    user_prompt="What is the main activity happening in this scene?"
)

Defect Detection

result, error = model(
    source="manufactured_part.jpg",
    user_prompt="Are there any visible defects or damage?"
)

Spatial Relationships

result, error = model(
    source="room.jpg",
    user_prompt="What is the position of the table relative to the window?"
)

VQA Prompt Guidelines

✅ Good VQA Prompts:

Clear and specific questions
Focus on observable visual elements
Use question words (What, Where, How many, etc.)

# Specific and clear
"What color is the car?"
"How many windows are visible?"
"Where is the person standing?"
"What type of building is this?"

❌ Avoid:

Vague or ambiguous prompts
Questions requiring external knowledge
Multiple questions in one prompt

# Too vague
"Tell me about this"

# Requires external knowledge
"Who is the person in this image?"

# Multiple questions (split into separate calls)
"What color is the car and how many doors does it have?"

VQA Response Structure

The VQA response includes:

result, error = model(source="image.jpg", user_prompt="What's in this image?")

if error is None:
    # Access the answer
    answer = result.caption
    print(f"Answer: {answer}")

Phrase Grounding

Phrase Grounding detects and locates objects in images with bounding boxes.

What is Phrase Grounding?

Phrase Grounding provides:

Input: An image and optionally a user prompt
Output: Detected phrases with bounding box coordinates

The model identifies objects and their locations in normalized coordinates.

Basic Phrase Grounding Example

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# Phrase grounding with custom prompt
result, error = model(
    source="image.jpg",
    user_prompt="Identify and locate all objects"  # Optional
)

# Or without custom prompt (uses default)
result, error = model(
    source="image.jpg"
)

if error is None and hasattr(result, 'grounded_phrases'):
    for phrase in result.grounded_phrases:
        print(f"Phrase: {phrase.phrase}")
        print(f"BBox: {phrase.bbox}")

Phrase Grounding Use Cases

Object Detection

# Detect all objects (default prompt)
result, error = model(
    source="scene.jpg"
)

if error is None and hasattr(result, 'grounded_phrases'):
    print(f"Found {len(result.grounded_phrases)} objects:")
    for phrase in result.grounded_phrases:
        print(f"  - {phrase.phrase} at {phrase.bbox}")

Specific Object Location

# Find specific objects
result, error = model(
    source="image.jpg",
    user_prompt="Locate all people and vehicles"
)

Safety Inspection

# Identify safety equipment
result, error = model(
    source="worksite.jpg",
    user_prompt="Identify and locate all safety equipment and protective gear"
)

Quality Control

# Detect defects with locations
result, error = model(
    source="product.jpg",
    user_prompt="Locate any defects, scratches, or imperfections"
)

Phrase Grounding Response Structure

The response includes detected phrases with bounding boxes:

result, error = model(source="image.jpg")

if error is None:
    # Caption/description
    print(f"Caption: {result.caption}")

    # Grounded phrases (if available)
    if hasattr(result, 'grounded_phrases'):
        for phrase in result.grounded_phrases:
            print(f"\nPhrase: {phrase.phrase}")
            print(f"BBox (normalized [0-1024]): {phrase.bbox}")
            # bbox format: [x_min, y_min, x_max, y_max]

Bounding Box Coordinates

Bounding boxes are in normalized coordinates [0, 1024]:

Format: [x_min, y_min, x_max, y_max]
Coordinate range: 0 to 1024
Top-left corner: (0, 0)
Bottom-right corner: (1024, 1024)

Convert to pixel coordinates:

from PIL import Image

def bbox_to_pixels(bbox, image_path):
    """Convert normalized bbox to pixel coordinates."""
    image = Image.open(image_path)
    width, height = image.size

    x_min, y_min, x_max, y_max = bbox

    return [
        int(x_min / 1024 * width),
        int(y_min / 1024 * height),
        int(x_max / 1024 * width),
        int(y_max / 1024 * height)
    ]

# Usage
result, error = model(source="image.jpg")
if error is None and hasattr(result, 'grounded_phrases'):
    for phrase in result.grounded_phrases:
        pixel_bbox = bbox_to_pixels(phrase.bbox, "image.jpg")
        print(f"{phrase.phrase}: {pixel_bbox}")

🎨
Visualize predictions easily
Use the built-in visualize_prediction() utility for automatic bounding box rendering with coordinate conversion, text wrapping, and optimal font sizing.

Learn more about result handling →

Phrase Grounding Prompt Guidelines

✅ Good Phrase Grounding Prompts:

Specific object categories
Clear detection targets
Optional (can omit for default behavior)

# Specific categories
"Locate all people and vehicles"
"Find all safety equipment"
"Identify furniture items"

# Detection focus
"Detect defects and damage"
"Locate text and labels"

# Default (no prompt)
result, error = model(source="image.jpg")

❌ Avoid:

Questions (use VQA instead)
Counting requests
Attribute queries

# Wrong task type (use VQA)
"How many cars are there?"  # Use VQA
"What color is the car?"     # Use VQA

Combining Task Types

You can use both task types in your workflow:

Sequential Analysis

# First: Phrase Grounding to detect objects
grounding_result, error = model(
    source="image.jpg",
    user_prompt="Locate all defects"
)

if error is None and hasattr(grounding_result, 'grounded_phrases'):
    print(f"Found {len(grounding_result.grounded_phrases)} defects")

    # Then: VQA for detailed analysis
    vqa_result, error = model(
        source="image.jpg",
        user_prompt="What type of defects are present?"
    )

    if error is None:
        print(f"Analysis: {vqa_result.caption}")

Verification Workflow

# 1. Detect objects
result, error = model(source="image.jpg")

if error is None and hasattr(result, 'grounded_phrases'):
    # 2. Verify specific objects
    for phrase in result.grounded_phrases:
        if "person" in phrase.phrase.lower():
            # Ask follow-up questions
            detail, err = model(
                source="image.jpg",
                user_prompt=f"What is the person at {phrase.bbox} doing?"
            )

Task Type Comparison

Aspect	VQA	Phrase Grounding
User Prompt	Required (question)	Optional
Output Format	Natural language answer	Caption + Bounding boxes
Primary Use	Understanding, Q&A	Detection, Localization
Example Prompt	"What color is the car?"	"Locate all vehicles"
Response Field	`result.result.answer` (VQA) `result.result.sentence` (Grounding)	`result.result.groundings`
Bounding Boxes	No	Yes

Best Practices

1. Choose the Right Task Type

Use VQA for:

Answering specific questions
Counting objects
Identifying attributes
Understanding relationships
Classification

Use Phrase Grounding for:

Detecting object locations
Spatial analysis
Quality inspection
Safety compliance
Inventory tracking

2. Craft Effective Prompts

For VQA:

# ✅ Good - specific question
"What color is the car?"

# ❌ Bad - too vague
"Tell me about this"

For Phrase Grounding:

# ✅ Good - clear detection target
"Locate all safety equipment"

# ❌ Bad - asking a question (use VQA)
"How many helmets are there?"

3. Handle Optional Prompts

For Phrase Grounding, omitting the prompt uses the default:

# With prompt - specific detection
result, error = model(
    source="image.jpg",
    user_prompt="Locate people and vehicles"
)

# Without prompt - general detection
result, error = model(
    source="image.jpg"
)

4. Validate Response Structure

Always check response type to access the correct fields:

from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse

result, error = model(source="image.jpg")

if error is None:
    # Check for VQA response
    if isinstance(result, VQAResponse):
        print(f"Answer: {result.result.answer}")

    # Check for Phrase Grounding response
    elif isinstance(result, PhraseGroundingResponse):
        print(f"Caption: {result.result.sentence}")
        print(f"Found {len(result.result.groundings)} objects")
        for grounding in result.result.groundings:
            print(f"  - {grounding.phrase}")

See complete response schemas →

Common Patterns

Multi-Question Analysis

questions = [
    "What is the main subject?",
    "What is the background setting?",
    "What time of day is it?",
    "Are there any people visible?"
]

for question in questions:
    result, error = model(
        source="image.jpg",
        user_prompt=question
    )
    if error is None:
        print(f"Q: {question}")
        print(f"A: {result.caption}\n")

Object-by-Object Analysis

# First detect objects
result, error = model(source="image.jpg")

if error is None and hasattr(result, 'grounded_phrases'):
    # Analyze each detected object
    for phrase in result.grounded_phrases:
        analysis, err = model(
            source="image.jpg",
            user_prompt=f"Describe the {phrase.phrase} in detail"
        )
        if err is None:
            print(f"{phrase.phrase}: {analysis.caption}")

Conditional Analysis

# Check for specific objects first
result, error = model(
    source="image.jpg",
    user_prompt="Are there any people in this image?"
)

if error is None and "yes" in result.caption.lower():
    # If people found, locate them
    location, err = model(
        source="image.jpg",
        user_prompt="Locate all people"
    )

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects

Task types

Overview

Task Type Selection

Response Schemas

Visual Question Answering (VQA)

What is VQA?

Basic VQA Example

VQA Use Cases

Object Counting

Attribute Identification

Scene Understanding

Defect Detection

Spatial Relationships

VQA Prompt Guidelines

VQA Response Structure

Phrase Grounding

What is Phrase Grounding?

Basic Phrase Grounding Example

Phrase Grounding Use Cases

Object Detection

Specific Object Location

Safety Inspection

Quality Control

Phrase Grounding Response Structure

Bounding Box Coordinates

Visualize predictions easily

Phrase Grounding Prompt Guidelines

Combining Task Types

Sequential Analysis

Verification Workflow

Task Type Comparison

Best Practices

1. Choose the Right Task Type

2. Craft Effective Prompts

3. Handle Optional Prompts

4. Validate Response Structure

Common Patterns

Multi-Question Analysis

Object-by-Object Analysis

Conditional Analysis

See also

Need help?