Task Types
Datature Vi supports three inference task types: visual question answering (VQA), phrase grounding, and freeform text. The model determines which applies based on your prompt and training configuration; you don't set it explicitly.
Each task type returns a different response structure. See prediction schemas for the complete field reference.
Visual question answering (VQA)
VQA answers natural language questions about an image. The model analyzes the image and generates a contextual text response to your question.
Input: image + question
Output: natural language answer (result.result.answer)
Basic example
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
result, error = model(
source="image.jpg",
user_prompt="What color is the car in this image?"
)
if error is None:
print(f"Answer: {result.result.answer}")VQA use cases
result, error = model(
source="crowd.jpg",
user_prompt="How many people are visible in this image?"
)result, error = model(
source="product.jpg",
user_prompt="What is the brand name on this product?"
)result, error = model(
source="manufactured_part.jpg",
user_prompt="Are there any visible defects or damage?"
)result, error = model(
source="scene.jpg",
user_prompt="What is the main activity happening in this scene?"
)result, error = model(
source="room.jpg",
user_prompt="What is the position of the table relative to the window?"
)VQA prompt guidelines
Good prompts are specific, focus on observable elements, and use question words (what, where, how many):
"What color is the car?"
"How many windows are visible?"
"Where is the person standing?"
"What type of building is this?"Avoid vague prompts, questions requiring external knowledge, and multiple questions in one call:
# Too vague
"Tell me about this"
# Requires external knowledge (model cannot know this)
"Who is the person in this image?"
# Multiple questions: split into separate calls
"What color is the car and how many doors does it have?"Phrase grounding
Phrase grounding detects objects in an image and returns their locations as bounding boxes. The prompt is optional; omitting it uses the model's default detection behavior.
Input: image (prompt optional)
Output: caption + list of objects with bounding boxes (result.result.sentence, result.result.groundings)
Basic phrase grounding example
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
# With custom prompt
result, error = model(
source="image.jpg",
user_prompt="Identify and locate all objects"
)
# Without prompt (default detection)
result, error = model(source="image.jpg")
if error is None:
print(f"Caption: {result.result.sentence}")
for grounding in result.result.groundings:
print(f"{grounding.phrase}: {grounding.grounding}")Phrase grounding use cases
result, error = model(source="scene.jpg")
if error is None:
print(f"Found {len(result.result.groundings)} objects:")
for grounding in result.result.groundings:
print(f" {grounding.phrase} at {grounding.grounding}")result, error = model(
source="worksite.jpg",
user_prompt="Identify and locate all safety equipment and protective gear"
)result, error = model(
source="product.jpg",
user_prompt="Locate any defects, scratches, or imperfections"
)result, error = model(
source="image.jpg",
user_prompt="Locate all people and vehicles"
)Bounding box coordinates
Bounding boxes use normalized coordinates in the range [0, 1024]:
- Format:
[x_min, y_min, x_max, y_max] - Top-left corner:
(0, 0) - Bottom-right corner:
(1024, 1024)
Convert to pixel coordinates for visualization:
from PIL import Image
def bbox_to_pixels(bbox, image_path):
"""Convert normalized bbox [0-1024] to pixel coordinates."""
image = Image.open(image_path)
width, height = image.size
x_min, y_min, x_max, y_max = bbox
return [
int(x_min / 1024 * width),
int(y_min / 1024 * height),
int(x_max / 1024 * width),
int(y_max / 1024 * height)
]
result, error = model(source="image.jpg")
if error is None:
for grounding in result.result.groundings:
for bbox in grounding.grounding:
pixel_bbox = bbox_to_pixels(bbox, "image.jpg")
print(f"{grounding.phrase}: {pixel_bbox}")The built-in visualize_prediction() utility handles coordinate conversion for you:
from vi.inference.utils.visualize import visualize_prediction
result, error = model(source="image.jpg")
if error is None:
image = visualize_prediction(image_path="image.jpg", prediction=result)
image.save("output.jpg")Learn more about result handling →
Phrase grounding prompt guidelines
Good prompts specify object categories or detection targets:
"Locate all people and vehicles"
"Find all safety equipment"
"Detect defects and damage"Avoid questions and counting requests (use VQA for those):
# Wrong task type for these (use VQA instead)
"How many cars are there?"
"What color is the car?"Freeform text
Freeform text generates open-ended responses from images. Use it for descriptions, reports, structured data extraction (JSON, YAML), or any custom output format. This is the default task type for pretrained HuggingFace models and for models trained with the freeform text dataset type.
Input: image or video + prompt (video is supported on the same model families as in the SDK Qwen-VL predictor; NVILA and DeepSeek OCR are image-only)
Output: generated text (result.result.caption)
Basic freeform example
from vi.inference import ViModel
model = ViModel(run_id="your-run-id")
result, error = model(
source="image.jpg",
user_prompt="Describe this image in detail."
)
if error is None:
print(f"Response: {result.result.caption}")Video inputs (freeform-trained models)
Point source at a video file, a video URL, or a data:video/... URI. The SDK infers video-freeform when the model task is freeform or generic. Optional fps on model(...) controls frame sampling (default 4.0); it is not part of generation_config.
result, error = model(
source="demo.mp4",
user_prompt="Summarize the main events in order.",
fps=2.0,
stream=False,
)
if error is None:
print(result.result.caption)Full video behavior, batching, and model coverage →
Freeform use cases
result, error = model(
source="invoice.jpg",
user_prompt="Extract the vendor name, invoice number, date, and total amount as JSON."
)
if error is None:
print(result.result.caption)
# e.g. {"vendor": "Acme Corp", "invoice_number": "INV-001", "date": "2026-03-15", "total": "$1,234.56"}result, error = model(
source="xray.jpg",
user_prompt="Generate a radiology report describing all findings."
)
if error is None:
print(result.result.caption)result, error = model(
source="product.jpg",
user_prompt="Produce an inspection report noting any surface defects, their locations, and severity."
)
if error is None:
print(result.result.caption)Freeform prompt guidelines
Good prompts specify the desired output format and scope:
"Describe the contents of this image in detail."
"Generate a JSON report with keys: condition, defects, recommendation."
"Write a caption for this product photo suitable for an e-commerce listing."Avoid prompts that fit VQA or phrase grounding better:
# Use VQA for direct questions
"What color is the car?"
# Use phrase grounding for localization
"Find all defects and draw boxes around them."Comparing task types
Checking the response type
Always use isinstance() to check which response type you received before accessing type-specific fields:
from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
from vi.inference.task_types.freeform import FreeformResponse
result, error = model(source="image.jpg")
if error is None:
if isinstance(result, VQAResponse):
print(f"Answer: {result.result.answer}")
elif isinstance(result, PhraseGroundingResponse):
print(f"Caption: {result.result.sentence}")
print(f"Objects: {len(result.result.groundings)}")
for grounding in result.result.groundings:
print(f" {grounding.phrase}")
elif isinstance(result, FreeformResponse):
print(f"Response: {result.result.caption}")See complete response schemas →
Combining both task types
Use phrase grounding for spatial detection first, then VQA for follow-up questions:
# First: locate defects
grounding_result, error = model(
source="image.jpg",
user_prompt="Locate all defects"
)
if error is None:
print(f"Found {len(grounding_result.result.groundings)} defects")
# Then: classify them
vqa_result, error = model(
source="image.jpg",
user_prompt="What type of defects are present?"
)
if error is None:
print(f"Analysis: {vqa_result.result.answer}")questions = [
"What is the main subject?",
"What is the background setting?",
"Are there any people visible?"
]
for question in questions:
result, error = model(source="image.jpg", user_prompt=question)
if error is None:
print(f"Q: {question}")
print(f"A: {result.result.answer}\n")FAQ
Related resources
Updated 30 days ago
