Annotate Images

Label your images with phrase grounding, VQA, or freeform text annotations to train a vision-language model.

Datature Vi supports three annotation types for images, each suited to a different vision-language model task. Choose based on what you need your model to do.

Before You Start

A dataset with uploaded images. Create a dataset if you don't have one yet.


Annotation types

Phrase grounding

Phrase grounding connects natural-language descriptions to specific regions in an image. You write a caption, draw bounding boxes around objects, then link each phrase to its box.

The result: a model that can answer "Find the large black chip" by returning a bounding box coordinate, not only a label.

Typical use cases:

  • Object detection with flexible, natural-language descriptions
  • Visual search and retrieval systems
  • Zero-shot object detection (no fixed category list required)
  • Image description and captioning tasks

Annotate images for phrase grounding

Visual question answering

Visual question answering teaches your model to answer questions about images. You write question-answer pairs that cover the decisions your model needs to make.

The result: a model that can answer "Is this product defective?" or "How many safety violations are present?" from a camera feed.

Typical use cases:

  • Quality control and inspection systems
  • Inventory counting and shelf monitoring
  • Compliance and safety verification
  • Condition assessment for agriculture or maintenance

Annotate images for VQA

Freeform text

Freeform text gives you a free-form text editor for each image. You can write any structured or unstructured text, including JSON, plain descriptions, or custom schemas that match your training pipeline.

The result: a model trained on flexible text outputs tailored to your specific task.

Typical use cases:

  • Custom JSON output schemas for specialized tasks
  • Detailed image descriptions and scene analysis
  • Multi-attribute annotations that don't fit predefined formats
  • Research and experimental annotation workflows

Annotate images with freeform text


Which type should you use?

Choose based on what you want your model to output:

If you need your model to...
Use
Locate objects by description
Phrase grounding
Return bounding box coordinates
Phrase grounding
Answer yes/no or count questions
VQA
Classify or assess conditions
VQA
Produce custom structured output
Freeform text
Output JSON or domain-specific formats
Freeform text
Both locate and answer questions
Use both in separate datasets

You can run different annotation types in separate datasets within the same project.


Annotation workflow

The general flow for any annotation type:

  1. Upload assets to your dataset
  2. Open the annotator from your dataset's Annotate tab
  3. Create annotations (manually, with AI assistance, or both)
  4. Review coverage using the dataset overview
  5. Train your model using the annotated dataset

Next steps