How Do I Choose a Dataset Type?

Pick the right dataset type for your task in Datature Vi. Compare phrase grounding, VQA, and freeform text by output format, annotation effort, and use case.

Your dataset type determines what your model outputs and how you annotate your data. Pick the wrong type and you will spend time annotating in a format that does not match your goal. This page helps you decide between the three types available in Datature Vi: phrase grounding, visual question answering, and freeform text.


Start with your output

The fastest way to choose a dataset type is to ask: what does my application need the model to return?

1

Bounding boxes with labels

You need the model to find objects in an image and draw boxes around them. Choose phrase grounding.

Examples: locate defects on a product, find specific items on a shelf, highlight regions matching a description.

2

Text answers to questions

You need the model to answer questions about an image in natural language. Choose visual question answering (VQA).

Examples: "Is there a crack in this tile?", "How many pallets are in this image?", "What color is the warning label?"

3

Custom or structured output

You need the model to return output in a specific format you define: JSON reports, checklists, multi-field descriptions, or any schema that does not fit the two types above. Choose freeform text.

Examples: generate a JSON inspection report, produce a medical image summary with fixed fields, output a structured product description.

Not sure yet?

If your task could fit more than one type, start with VQA. It is the most flexible of the three. You can switch later by creating a new dataset; your images stay in your workspace.


Side-by-side comparison

Feature
Phrase Grounding
VQA
Freeform Text
Model output
Bounding boxes + text labels
Natural language text
Any text format you define
Annotation format
Image-phrase-box triplets
Question-answer pairs
Custom schema (you define)
Evaluation metrics
IoU, F1, Precision, Recall
BLEU, BERTScore
BLEU, BERTScore
Annotation effort
Medium: draw boxes and write descriptions
Low: write question-answer pairs
Varies: depends on your schema complexity
Best for
Object localization, spatial tasks
Image understanding, yes/no checks
Structured reports, custom formats, research
Supports CoT
No
Yes
Yes

Match your industry

Different industries tend to favor specific dataset types. Use these patterns as a starting point, not a rule.

Primary type: Phrase grounding

You train the model to locate defects, misaligned parts, or missing components by describing them in natural language. The model returns bounding boxes around each match.

When to use VQA instead: If you need binary pass/fail answers ("Does this PCB have solder bridging?") rather than localized defect boxes.

When to use freeform text: If you need a structured inspection report with multiple fields (defect type, severity, location, recommended action) returned as JSON.

Primary type: Freeform text (with structured data extraction)

You define a schema that matches your document fields (invoice number, line items, totals) and train the model to extract them into JSON or YAML. See Structured Data Extraction for details.

When to use VQA instead: If you only need to answer specific questions about documents ("What is the total amount?" or "Who is the sender?") without extracting the full document structure.

Primary type: VQA with chain-of-thought reasoning

Medical imaging benefits from explainable answers. Pair VQA with chain-of-thought annotations so the model explains its reasoning: "I see a 3mm opacity in the lower left lobe, consistent with..." This makes outputs auditable.

When to use phrase grounding instead: If you need the model to mark the location of findings (tumors, fractures, anomalies) with bounding boxes.

Primary type: Phrase grounding

Locate products on shelves, identify out-of-stock positions, or find specific items in warehouse images.

When to use VQA instead: If you need to count items ("How many units of SKU-1234 are on this shelf?") or check compliance ("Is the promotional display set up correctly?").

Primary type: VQA

Analyze aerial or drone images with questions about crop health, irrigation status, or land use. VQA handles the variety of questions you can ask about satellite and drone imagery.

When to use phrase grounding instead: If you need to mark specific regions (diseased patches, flooded areas) with bounding boxes for downstream GIS processing.


Frequently asked questions

No. The dataset type is set when you create the dataset and cannot be changed. However, you can create a new dataset with a different type and upload the same images. Your images remain in your workspace regardless of which datasets reference them.

Yes. You can create separate datasets with different types and train separate models. Each model will specialize in its task. At inference time, you choose which model to call based on what output you need.

VQA requires the least effort per annotation: one question and one answer per image. Phrase grounding requires drawing bounding boxes, which takes more time. Freeform text varies depending on your schema complexity.

Freeform text is designed for this. It lets you define any annotation schema. If your output format does not match phrase grounding boxes or VQA question-answer pairs, freeform text gives you full flexibility.

Chain-of-thought reasoning works with VQA and freeform text datasets. Use it when your task involves multi-step logic, when you need the model to show its work, or when the correct answer depends on analyzing multiple parts of the image. See Chain-of-Thought Reasoning for setup details.


Related resources