Annotate Data

Add labels to your images and videos to teach a vision-language model what to recognize and how to respond.

Annotations are the labeled examples your vision-language model (VLM) learns from. Datature Vi supports annotations for both images and videos, with different annotation types suited to different tasks. This page helps you pick the right asset type and annotation format, and points you to the step-by-step guides.

Before You Start

A dataset with uploaded images or uploaded videos. Create a dataset if you don't have one yet.

New to Datature Vi? Learn what it does or follow the quickstart.

By the end of this guide

Add labels to your images and videos using phrase grounding, VQA, and freeform text annotations.


Annotate by asset type

Annotate Images

Label images with phrase grounding, VQA, or freeform text annotations for VLM training.

Annotate Videos

Annotate video frame sequences with freeform text for temporal reasoning tasks.


Image annotation types

Datature Vi supports three annotation types for images:

Phrase Grounding

Link text descriptions to bounding boxes. Teaches your model to locate objects by natural-language description.

Visual Question Answering

Create question-answer pairs about images. Teaches your model to answer specific questions about what it sees.

Freeform Text

Write open-ended text annotations in any structure for flexible model training.

Video annotation types

Datature Vi supports freeform text annotation for videos. Use the timeline scrubber to select frame sequences and write text annotations that describe actions, events, and temporal relationships.

Freeform Text

Annotate video frame sequences with freeform text for action recognition and temporal reasoning.


Which type should you use?

Choose based on what you want your model to output:

If you need your model to...
Use
Locate objects by description
Phrase grounding (images)
Return bounding box coordinates
Phrase grounding (images)
Answer yes/no or count questions
VQA (images)
Classify or assess conditions
VQA (images)
Produce custom structured output
Freeform text (images)
Describe actions and events in video
Freeform text (videos)
Analyze temporal behavior or physics
Freeform text (videos)

You can run different annotation types in separate datasets within the same project.


Annotation workflow

The general flow for any annotation type:

  1. Upload assets to your dataset
  2. Open the annotator from your dataset's Annotate tab
  3. Create annotations (manually, with AI assistance, or both)
  4. Review coverage using the dataset overview
  5. Train your model using the annotated dataset

AI-assisted annotation

Datature Vi includes IntelliScribe, an AI tool that speeds up phrase grounding and freeform image annotation. For phrase grounding, IntelliScribe can generate captions automatically (press C) and link phrases to bounding boxes (press P). For freeform image annotation, IntelliScribe generates text content (press C) that you can edit to match your schema.

AI assistance is most useful on large datasets with common objects. For domain-specific content, generate the AI text as a starting point, then edit for your vocabulary.

Learn about AI-assisted tools


Next steps

Annotate Images

Choose from phrase grounding, VQA, or freeform text annotations for your images.

Annotate Videos

Annotate video frame sequences with freeform text for temporal reasoning.

AI-Assisted Tools

Learn how IntelliScribe generates captions and links phrases automatically to speed up annotation.