Phrase Grounding

Draw bounding boxes and link them to text descriptions to build phrase grounding training data for your VLM.

Before You Start

Phrase grounding annotations teach your vision-language model (VLM) to locate objects by their text description. Each annotation pairs a written caption with bounding boxes, where specific phrases in the caption link to the boxes they describe. This guide walks through creating those annotations in Datature Vi.


Open the annotator

Go to your dataset, then click the Annotate tab. Click any image thumbnail in the bottom strip to load it onto the canvas.

1

Open the Annotator tab

Open the Annotator tab

From the Dataset Overview page, click the Annotator tab to open the labeling interface.

You should see
Dataset Overview showing image and annotation count, and heatmaps showing annotation patterns

Your annotations are ready when you see annotation count matching the image count, and heatmaps showing annotation patterns.


Keyboard shortcuts

Keyboard shortcuts

Key
Action
R
Rectangle tool: draw bounding boxes
T
Text tool: write or edit the caption
H
Highlight tool: link phrases to boxes
D
Delete tool: remove boxes or phrase links
C
IntelliScribe Caption: AI-generate a caption
P
IntelliScribe Phrases: AI-link phrases to boxes
E
Next image
Q
Previous image
Esc
Exit current tool mode
?
Show all shortcuts

Annotation guidelines

These guidelines produce annotations that train well. Inconsistent annotations train poorly regardless of quantity.

Captions

  • Be specific: "FPGA chip" over "chip", "blue safety helmet" over "helmet"
  • Include spatial info: "in the center", "on the left edge", "above the connector"
  • Use consistent terminology across your entire dataset

Bounding boxes

  • Include the full object with minimal empty space
  • Stay consistent about box tightness across images
  • For partial objects, box the visible portion and note it in the caption

Phrase links

  • Each highlighted phrase must be a distinct, non-overlapping region
  • One phrase per box; split complex descriptions into separate pairs
  • Aim for 5-10 phrase-box pairs per image (3-5 minimum, up to 15 for complex scenes)

Edit or delete annotations

You can modify or remove annotations at any time.

To edit a caption: Press T, click in the text area, make your changes. Changes save automatically.

To resize or reposition a box: Click the box, then drag it or drag a handle. The phrase link stays connected.

To delete a box: Press D, then click the box. The box and its phrase link are removed. The caption text remains.

To unlink a phrase: Press D, then click the highlighted phrase in the caption. The highlight and link are removed, but the box and caption text remain.

Deletions Cannot Be Undone

Deleted boxes and phrase links cannot be recovered. Export your dataset regularly as a backup.


Chain-of-thought reasoning

You can train your model to reason step-by-step by prepending <datature_think> tags to caption text in your annotations. During training, Datature Vi converts these to the model's native <think> tags.

For a phrase grounding caption identifying a cracked component:

<datature_think>The surface has a visible line running diagonally from the top-left.
It extends roughly 3 cm based on the scale reference.
The surrounding area shows slight discoloration, which suggests stress damage.</datature_think>

cracked mounting bracket with stress fracture

The text inside <datature_think>...</datature_think> becomes the model's internal reasoning. The text after the closing tag is the caption that gets linked to the bounding box.

See Chain-of-Thought Reasoning and Annotation Guide for the format.


Next steps

Train A Model

Use your phrase grounding annotations to fine-tune a vision-language model.

AI-Assisted Tools

Learn more about IntelliScribe Caption and IntelliScribe Phrases.

Dataset Overview

Check annotation coverage and quality across your dataset.