Task Types

Datature Vi supports three inference task types: visual question answering (VQA), phrase grounding, and freeform text. The model determines which applies based on your prompt and training configuration; you don't set it explicitly.

Each task type returns a different response structure. See prediction schemas for the complete field reference.

Visual question answering (VQA)

VQA answers natural language questions about an image. The model analyzes the image and generates a contextual text response to your question.

Input: image + question Output: natural language answer (result.result.answer)

Basic example

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

result, error = model(
    source="image.jpg",
    user_prompt="What color is the car in this image?"
)

if error is None:
    print(f"Answer: {result.result.answer}")

VQA use cases

result, error = model(
    source="crowd.jpg",
    user_prompt="How many people are visible in this image?"
)
result, error = model(
    source="product.jpg",
    user_prompt="What is the brand name on this product?"
)
result, error = model(
    source="manufactured_part.jpg",
    user_prompt="Are there any visible defects or damage?"
)
result, error = model(
    source="scene.jpg",
    user_prompt="What is the main activity happening in this scene?"
)
result, error = model(
    source="room.jpg",
    user_prompt="What is the position of the table relative to the window?"
)

VQA prompt guidelines

Good prompts are specific, focus on observable elements, and use question words (what, where, how many):

"What color is the car?"
"How many windows are visible?"
"Where is the person standing?"
"What type of building is this?"

Avoid vague prompts, questions requiring external knowledge, and multiple questions in one call:

# Too vague
"Tell me about this"

# Requires external knowledge (model cannot know this)
"Who is the person in this image?"

# Multiple questions: split into separate calls
"What color is the car and how many doors does it have?"

Phrase grounding

Phrase grounding detects objects in an image and returns their locations as bounding boxes. The prompt is optional; omitting it uses the model's default detection behavior.

Input: image (prompt optional) Output: caption + list of objects with bounding boxes (result.result.sentence, result.result.groundings)

Basic phrase grounding example

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

# With custom prompt
result, error = model(
    source="image.jpg",
    user_prompt="Identify and locate all objects"
)

# Without prompt (default detection)
result, error = model(source="image.jpg")

if error is None:
    print(f"Caption: {result.result.sentence}")
    for grounding in result.result.groundings:
        print(f"{grounding.phrase}: {grounding.grounding}")

Phrase grounding use cases

result, error = model(source="scene.jpg")

if error is None:
    print(f"Found {len(result.result.groundings)} objects:")
    for grounding in result.result.groundings:
        print(f"  {grounding.phrase} at {grounding.grounding}")
result, error = model(
    source="worksite.jpg",
    user_prompt="Identify and locate all safety equipment and protective gear"
)
result, error = model(
    source="product.jpg",
    user_prompt="Locate any defects, scratches, or imperfections"
)
result, error = model(
    source="image.jpg",
    user_prompt="Locate all people and vehicles"
)

Bounding box coordinates

Bounding boxes use normalized coordinates in the range [0, 1024]:

  • Format: [x_min, y_min, x_max, y_max]
  • Top-left corner: (0, 0)
  • Bottom-right corner: (1024, 1024)

Convert to pixel coordinates for visualization:

from PIL import Image

def bbox_to_pixels(bbox, image_path):
    """Convert normalized bbox [0-1024] to pixel coordinates."""
    image = Image.open(image_path)
    width, height = image.size
    x_min, y_min, x_max, y_max = bbox

    return [
        int(x_min / 1024 * width),
        int(y_min / 1024 * height),
        int(x_max / 1024 * width),
        int(y_max / 1024 * height)
    ]

result, error = model(source="image.jpg")
if error is None:
    for grounding in result.result.groundings:
        for bbox in grounding.grounding:
            pixel_bbox = bbox_to_pixels(bbox, "image.jpg")
            print(f"{grounding.phrase}: {pixel_bbox}")

The built-in visualize_prediction() utility handles coordinate conversion for you:

from vi.inference.utils.visualize import visualize_prediction

result, error = model(source="image.jpg")
if error is None:
    image = visualize_prediction(image_path="image.jpg", prediction=result)
    image.save("output.jpg")

Learn more about result handling →

Phrase grounding prompt guidelines

Good prompts specify object categories or detection targets:

"Locate all people and vehicles"
"Find all safety equipment"
"Detect defects and damage"

Avoid questions and counting requests (use VQA for those):

# Wrong task type for these (use VQA instead)
"How many cars are there?"
"What color is the car?"

Freeform text

Freeform text generates open-ended responses from images. Use it for descriptions, reports, structured data extraction (JSON, YAML), or any custom output format. This is the default task type for pretrained HuggingFace models and for models trained with the freeform text dataset type.

Input: image or video + prompt (video is supported on the same model families as in the SDK Qwen-VL predictor; NVILA and DeepSeek OCR are image-only)

Output: generated text (result.result.caption)

Basic freeform example

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

result, error = model(
    source="image.jpg",
    user_prompt="Describe this image in detail."
)

if error is None:
    print(f"Response: {result.result.caption}")

Video inputs (freeform-trained models)

Point source at a video file, a video URL, or a data:video/... URI. The SDK infers video-freeform when the model task is freeform or generic. Optional fps on model(...) controls frame sampling (default 4.0); it is not part of generation_config.

result, error = model(
    source="demo.mp4",
    user_prompt="Summarize the main events in order.",
    fps=2.0,
    stream=False,
)

if error is None:
    print(result.result.caption)

Full video behavior, batching, and model coverage →

Freeform use cases

result, error = model(
    source="invoice.jpg",
    user_prompt="Extract the vendor name, invoice number, date, and total amount as JSON."
)

if error is None:
    print(result.result.caption)
    # e.g. {"vendor": "Acme Corp", "invoice_number": "INV-001", "date": "2026-03-15", "total": "$1,234.56"}
result, error = model(
    source="xray.jpg",
    user_prompt="Generate a radiology report describing all findings."
)

if error is None:
    print(result.result.caption)
result, error = model(
    source="product.jpg",
    user_prompt="Produce an inspection report noting any surface defects, their locations, and severity."
)

if error is None:
    print(result.result.caption)

Freeform prompt guidelines

Good prompts specify the desired output format and scope:

"Describe the contents of this image in detail."
"Generate a JSON report with keys: condition, defects, recommendation."
"Write a caption for this product photo suitable for an e-commerce listing."

Avoid prompts that fit VQA or phrase grounding better:

# Use VQA for direct questions
"What color is the car?"

# Use phrase grounding for localization
"Find all defects and draw boxes around them."

Comparing task types

VQA
Phrase grounding
Freeform text
User prompt
Required
Optional
Required
Output
Natural language answer
Caption + bounding boxes
Open-ended text or structured data
Primary use
Understanding, Q&A, counting
Detection, localization
Descriptions, reports, JSON extraction
Example prompt
"What color is the car?"
"Locate all vehicles"
"Describe this image in detail."
Answer field
result.result.answer
result.result.sentence
result.result.caption
Bounding boxes
No
Yes (result.result.groundings)
No

Checking the response type

Always use isinstance() to check which response type you received before accessing type-specific fields:

from vi.inference.task_types.vqa import VQAResponse
from vi.inference.task_types.phrase_grounding import PhraseGroundingResponse
from vi.inference.task_types.freeform import FreeformResponse

result, error = model(source="image.jpg")

if error is None:
    if isinstance(result, VQAResponse):
        print(f"Answer: {result.result.answer}")

    elif isinstance(result, PhraseGroundingResponse):
        print(f"Caption: {result.result.sentence}")
        print(f"Objects: {len(result.result.groundings)}")
        for grounding in result.result.groundings:
            print(f"  {grounding.phrase}")

    elif isinstance(result, FreeformResponse):
        print(f"Response: {result.result.caption}")

See complete response schemas →

Combining both task types

Use phrase grounding for spatial detection first, then VQA for follow-up questions:

# First: locate defects
grounding_result, error = model(
    source="image.jpg",
    user_prompt="Locate all defects"
)

if error is None:
    print(f"Found {len(grounding_result.result.groundings)} defects")

    # Then: classify them
    vqa_result, error = model(
        source="image.jpg",
        user_prompt="What type of defects are present?"
    )

    if error is None:
        print(f"Analysis: {vqa_result.result.answer}")
questions = [
    "What is the main subject?",
    "What is the background setting?",
    "Are there any people visible?"
]

for question in questions:
    result, error = model(source="image.jpg", user_prompt=question)
    if error is None:
        print(f"Q: {question}")
        print(f"A: {result.result.answer}\n")

FAQ

No. The model determines the task type based on your prompt and how it was trained. You don't pass a task type parameter. Write prompts appropriate for the behavior you want.

If the model output does not match VQAResponse, PhraseGroundingResponse, or FreeformResponse, the SDK falls back to a GenericResponse, which contains the raw text output in result.result. This happens when the model output cannot be parsed into a structured format. Check result.raw_output for debugging. See prediction schemas for details.

Yes. Omitting user_prompt for a phrase grounding model triggers default detection behavior: the model decides what to detect. To focus detection on specific objects or categories, provide a prompt.

Related resources

Prediction Schemas

Complete field reference for VQAResponse, PhraseGroundingResponse, FreeformResponse, and GenericResponse.

Handle Results

Access captions and bounding boxes, convert coordinates, visualize predictions.

Run Inference

Single-image, batch, streaming, and folder processing.