Task Types
Task types
Two supported task types for vision-language models: Visual Question Answering and Phrase Grounding.
Overview
Task types define how the model processes images and generates responses:
- Visual Question Answering (VQA) - Answer questions about image content
- Phrase Grounding - Detect and locate objects in images
Task Type SelectionThe model automatically determines the appropriate task type based on your prompt and training configuration. You don't need to explicitly specify the task type.
Response SchemasEach task type returns a different response structure with specific fields. For complete details on all available fields and how to access them:
Visual Question Answering (VQA)
VQA enables models to answer natural language questions about images.
What is VQA?
Visual Question Answering requires:
- Input: An image and a question (user prompt)
- Output: A natural language answer
The model analyzes the image and generates a contextual answer to your question.
Basic VQA Example
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
# Ask a question about the image
result, error = model(
source="image.jpg",
user_prompt="What color is the car in this image?" # Question required
)
if error is None:
print(f"Answer: {result.caption}")VQA Use Cases
Object Counting
result, error = model(
source="crowd.jpg",
user_prompt="How many people are visible in this image?"
)Attribute Identification
result, error = model(
source="product.jpg",
user_prompt="What is the brand name on this product?"
)Scene Understanding
result, error = model(
source="scene.jpg",
user_prompt="What is the main activity happening in this scene?"
)Defect Detection
result, error = model(
source="manufactured_part.jpg",
user_prompt="Are there any visible defects or damage?"
)Spatial Relationships
result, error = model(
source="room.jpg",
user_prompt="What is the position of the table relative to the window?"
)VQA Prompt Guidelines
✅ Good VQA Prompts:
- Clear and specific questions
- Focus on observable visual elements
- Use question words (What, Where, How many, etc.)
# Specific and clear
"What color is the car?"
"How many windows are visible?"
"Where is the person standing?"
"What type of building is this?"❌ Avoid:
- Vague or ambiguous prompts
- Questions requiring external knowledge
- Multiple questions in one prompt
# Too vague
"Tell me about this"
# Requires external knowledge
"Who is the person in this image?"
# Multiple questions (split into separate calls)
"What color is the car and how many doors does it have?"VQA Response Structure
The VQA response includes:
result, error = model(source="image.jpg", user_prompt="What's in this image?")
if error is None:
# Access the answer
answer = result.caption
print(f"Answer: {answer}")Phrase Grounding
Phrase Grounding detects and locates objects in images with bounding boxes.
What is Phrase Grounding?
Phrase Grounding provides:
- Input: An image and optionally a user prompt
- Output: Detected phrases with bounding box coordinates
The model identifies objects and their locations in normalized coordinates.
Basic Phrase Grounding Example
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
# Phrase grounding with custom prompt
result, error = model(
source="image.jpg",
user_prompt="Identify and locate all objects" # Optional
)
# Or without custom prompt (uses default)
result, error = model(
source="image.jpg"
)
if error is None and hasattr(result, 'grounded_phrases'):
for phrase in result.grounded_phrases:
print(f"Phrase: {phrase.phrase}")
print(f"BBox: {phrase.bbox}")Phrase Grounding Use Cases
Object Detection
# Detect all objects (default prompt)
result, error = model(
source="scene.jpg"
)
if error is None and hasattr(result, 'grounded_phrases'):
print(f"Found {len(result.grounded_phrases)} objects:")
for phrase in result.grounded_phrases:
print(f" - {phrase.phrase} at {phrase.bbox}")Specific Object Location
# Find specific objects
result, error = model(
source="image.jpg",
user_prompt="Locate all people and vehicles"
)Safety Inspection
# Identify safety equipment
result, error = model(
source="worksite.jpg",
user_prompt="Identify and locate all safety equipment and protective gear"
)Quality Control
# Detect defects with locations
result, error = model(
source="product.jpg",
user_prompt="Locate any defects, scratches, or imperfections"
)Phrase Grounding Response Structure
The response includes detected phrases with bounding boxes:
result, error = model(source="image.jpg")
if error is None:
# Caption/description
print(f"Caption: {result.caption}")
# Grounded phrases (if available)
if hasattr(result, 'grounded_phrases'):
for phrase in result.grounded_phrases:
print(f"\nPhrase: {phrase.phrase}")
print(f"BBox (normalized [0-1024]): {phrase.bbox}")
# bbox format: [x_min, y_min, x_max, y_max]Bounding Box Coordinates
Bounding boxes are in normalized coordinates [0, 1024]:
- Format:
[x_min, y_min, x_max, y_max] - Coordinate range:
0to1024 - Top-left corner:
(0, 0) - Bottom-right corner:
(1024, 1024)
Convert to pixel coordinates:
from PIL import Image
def bbox_to_pixels(bbox, image_path):
"""Convert normalized bbox to pixel coordinates."""
image = Image.open(image_path)
width, height = image.size
x_min, y_min, x_max, y_max = bbox
return [
int(x_min / 1024 * width),
int(y_min / 1024 * height),
int(x_max / 1024 * width),
int(y_max / 1024 * height)
]
# Usage
result, error = model(source="image.jpg")
if error is None and hasattr(result, 'grounded_phrases'):
for phrase in result.grounded_phrases:
pixel_bbox = bbox_to_pixels(phrase.bbox, "image.jpg")
print(f"{phrase.phrase}: {pixel_bbox}")
Visualize predictions easilyUse the built-in
visualize_prediction()utility for automatic bounding box rendering with coordinate conversion, text wrapping, and optimal font sizing.
Learn more about result handling →
Phrase Grounding Prompt Guidelines
✅ Good Phrase Grounding Prompts:
- Specific object categories
- Clear detection targets
- Optional (can omit for default behavior)
# Specific categories
"Locate all people and vehicles"
"Find all safety equipment"
"Identify furniture items"
# Detection focus
"Detect defects and damage"
"Locate text and labels"
# Default (no prompt)
result, error = model(source="image.jpg")❌ Avoid:
- Questions (use VQA instead)
- Counting requests
- Attribute queries
# Wrong task type (use VQA)
"How many cars are there?" # Use VQA
"What color is the car?" # Use VQACombining Task Types
You can use both task types in your workflow:
Sequential Analysis
# First: Phrase Grounding to detect objects
grounding_result, error = model(
source="image.jpg",
user_prompt="Locate all defects"
)
if error is None and hasattr(grounding_result, 'grounded_phrases'):
print(f"Found {len(grounding_result.grounded_phrases)} defects")
# Then: VQA for detailed analysis
vqa_result, error = model(
source="image.jpg",
user_prompt="What type of defects are present?"
)
if error is None:
print(f"Analysis: {vqa_result.caption}")Verification Workflow
# 1. Detect objects
result, error = model(source="image.jpg")
if error is None and hasattr(result, 'grounded_phrases'):
# 2. Verify specific objects
for phrase in result.grounded_phrases:
if "person" in phrase.phrase.lower():
# Ask follow-up questions
detail, err = model(
source="image.jpg",
user_prompt=f"What is the person at {phrase.bbox} doing?"
)Task Type Comparison
| Aspect | VQA | Phrase Grounding |
|---|---|---|
| User Prompt | Required (question) | Optional |
| Output Format | Natural language answer | Caption + Bounding boxes |
| Primary Use | Understanding, Q&A | Detection, Localization |
| Example Prompt | "What color is the car?" | "Locate all vehicles" |
| Response Field | result.result.answer (VQA)result.result.sentence (Grounding) | result.result.groundings |
| Bounding Boxes | No | Yes |
Best Practices
1. Choose the Right Task Type
Use VQA for:
- Answering specific questions
- Counting objects
- Identifying attributes
- Understanding relationships
- Classification
Use Phrase Grounding for:
- Detecting object locations
- Spatial analysis
- Quality inspection
- Safety compliance
- Inventory tracking
2. Craft Effective Prompts
For VQA:
# ✅ Good - specific question
"What color is the car?"
# ❌ Bad - too vague
"Tell me about this"For Phrase Grounding:
# ✅ Good - clear detection target
"Locate all safety equipment"
# ❌ Bad - asking a question (use VQA)
"How many helmets are there?"3. Handle Optional Prompts
For Phrase Grounding, omitting the prompt uses the default:
# With prompt - specific detection
result, error = model(
source="image.jpg",
user_prompt="Locate people and vehicles"
)
# Without prompt - general detection
result, error = model(
source="image.jpg"
)4. Validate Response Structure
Always check response type to access the correct fields:
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
result, error = model(source="image.jpg")
if error is None:
# Check for VQA response
if isinstance(result, VQAResponse):
print(f"Answer: {result.result.answer}")
# Check for Phrase Grounding response
elif isinstance(result, PhraseGroundingResponse):
print(f"Caption: {result.result.sentence}")
print(f"Found {len(result.result.groundings)} objects")
for grounding in result.result.groundings:
print(f" - {grounding.phrase}")See complete response schemas →
Common Patterns
Multi-Question Analysis
questions = [
"What is the main subject?",
"What is the background setting?",
"What time of day is it?",
"Are there any people visible?"
]
for question in questions:
result, error = model(
source="image.jpg",
user_prompt=question
)
if error is None:
print(f"Q: {question}")
print(f"A: {result.caption}\n")Object-by-Object Analysis
# First detect objects
result, error = model(source="image.jpg")
if error is None and hasattr(result, 'grounded_phrases'):
# Analyze each detected object
for phrase in result.grounded_phrases:
analysis, err = model(
source="image.jpg",
user_prompt=f"Describe the {phrase.phrase} in detail"
)
if err is None:
print(f"{phrase.phrase}: {analysis.caption}")Conditional Analysis
# Check for specific objects first
result, error = model(
source="image.jpg",
user_prompt="Are there any people in this image?"
)
if error is None and "yes" in result.caption.lower():
# If people found, locate them
location, err = model(
source="image.jpg",
user_prompt="Locate all people"
)See also
- Inference Overview — Getting started with inference
- Running Inference — Execute predictions on images
- Prediction Schemas — Complete reference for response types and fields
- Generation Config — Control output parameters
- Result Handling — Process and visualize results
- Phrase Grounding Concepts — Understanding phrase grounding
- VQA Concepts — Understanding VQA
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
