Annotate Images

Datature Vi supports three annotation types for images, each suited to a different vision-language model task. Choose based on what you need your model to do.

Before You Start

A dataset with uploaded images. Create a dataset if you don't have one yet.

Annotation types

Phrase Grounding

Link text descriptions to bounding boxes. Teaches your model to locate objects by natural-language description.

Visual Question Answering

Create question-answer pairs about images. Teaches your model to answer specific questions about what it sees.

Freeform Text

Write open-ended text annotations in any structure. Teaches your model with flexible, unstructured text data.

Phrase grounding

Phrase grounding connects natural-language descriptions to specific regions in an image. You write a caption, draw bounding boxes around objects, then link each phrase to its box.

The result: a model that can answer "Find the large black chip" by returning a bounding box coordinate, not only a label.

Typical use cases:

Object detection with flexible, natural-language descriptions
Visual search and retrieval systems
Zero-shot object detection (no fixed category list required)
Image description and captioning tasks

Annotate images for phrase grounding

Visual question answering

Visual question answering teaches your model to answer questions about images. You write question-answer pairs that cover the decisions your model needs to make.

The result: a model that can answer "Is this product defective?" or "How many safety violations are present?" from a camera feed.

Typical use cases:

Quality control and inspection systems
Inventory counting and shelf monitoring
Compliance and safety verification
Condition assessment for agriculture or maintenance

Annotate images for VQA

Freeform text

Freeform text gives you a free-form text editor for each image. You can write any structured or unstructured text, including JSON, plain descriptions, or custom schemas that match your training pipeline.

The result: a model trained on flexible text outputs tailored to your specific task.

Typical use cases:

Custom JSON output schemas for specialized tasks
Detailed image descriptions and scene analysis
Multi-attribute annotations that don't fit predefined formats
Research and experimental annotation workflows

Annotate images with freeform text

Which type should you use?

Choose based on what you want your model to output:

If you need your model to...

Use

Locate objects by description

Phrase grounding

Return bounding box coordinates

Phrase grounding

Answer yes/no or count questions

VQA

Classify or assess conditions

VQA

Produce custom structured output

Freeform text

Output JSON or domain-specific formats

Freeform text

Both locate and answer questions

Use both in separate datasets

You can run different annotation types in separate datasets within the same project.

Annotation workflow

The general flow for any annotation type:

Upload assets to your dataset
Open the annotator from your dataset's Annotate tab
Create annotations (manually, with AI assistance, or both)
Review coverage using the dataset overview
Train your model using the annotated dataset

Next steps

Annotate For Phrase Grounding

Step-by-step guide to creating phrase grounding annotations with captions, bounding boxes, and phrase links.

Annotate For VQA

Step-by-step guide to adding question-answer pairs to your images for visual question answering tasks.

Annotate With Freeform Text

Step-by-step guide to writing open-ended text annotations for flexible model training.