Annotate For Phrase Grounding

Phrase grounding annotations teach your VLM to understand the connection between text descriptions and specific objects in images. Create comprehensive visual grounding datasets using Datature's annotation tools.

📋
Prerequisites

A dataset with uploaded images

Understanding of what objects you want to detect and describe

Basic familiarity with phrase grounding concepts

Create a dataset if you don't have one yet.

💡
What is phrase grounding?
Phrase grounding connects natural language descriptions to specific regions in an image. For example, linking the phrase "large black FPGA chip in the center" to its bounding box location. This enables models to understand and locate objects based on text descriptions.
Learn more about phrase grounding →

Open the annotator

Navigate to your dataset and click the Annotator tab
The annotator will open with empty views
Click on an image thumbnail at the bottom to load it

📘
Annotator interface
The phrase grounding annotator has three main areas:

Left side — Image Annotator with drawing tools (displays your selected image)

Right side — Phrase Grounding panel with caption and IntelliScribe options

Bottom — Image thumbnail navigation strip (click any thumbnail to load that image)

The IntelliScribe prompt helps with AI-assisted annotation but can be dismissed if you prefer manual annotation.

Create phrase grounding annotations

Phrase grounding annotations link text descriptions (phrases) to visual regions (bounding boxes). Each annotation has two components:

Caption — A text description of what's in the image
Linked bounding boxes — Boxes drawn around objects, connected to specific phrases in the caption

Understand the interface

Before annotating, familiarize yourself with the available tools:

Tool	Shortcut	Purpose
Rectangle	R	Draw bounding boxes around objects
Text	T	Write or edit the image caption
Highlight	H	Highlight phrases in the caption to link to boxes
Delete	D	Remove bounding boxes or phrase links
IntelliScribe Caption	C	AI-generate a caption based on the image
IntelliScribe Phrases	P	AI-highlight phrases for unlinked boxes
Esc	Esc	Exit current tool mode

Learn about AI-assisted tools →

Write or generate captions

Every phrase grounding annotation starts with a caption that describes the image content.

Option 1: Write caption manually

Press T or click the Text tool
Type your caption in the Phrase Grounding panel on the right
Describe the main objects, their attributes, and relationships

Example captions:

"A black FPGA development board featuring a large black FPGA chip in the center, a row of gold pin headers along the top, a 7-segment LED display on the right showing digits"
"Two red apples and three green apples on a wooden table"
"A person wearing a blue safety helmet standing next to industrial machinery"

Option 2: Generate caption with AI

Press C or click the IntelliScribe Caption button
Review the AI-generated caption in the Phrase Grounding panel
Edit as needed using the Text tool (press T)

Learn more about IntelliScribe →

✅
Caption writing best practices

Be descriptive — Include colors, sizes, positions, and other distinguishing features

Be specific — Use precise terms for objects ("FPGA chip" vs. "chip")

Include spatial relationships — Mention locations ("on the left", "in the center", "at the top")

Keep it natural — Write complete sentences as you would describe the image to someone

See Annotation best practices for detailed guidelines.

Link phrases to bounding boxes

After writing a caption, link specific phrases to bounding boxes around the objects they describe. There are two main workflows:

Workflow 1: Caption first, then boxes

This workflow is ideal when you know what you want to describe before drawing boxes.

[GIF PLACEHOLDER: Type caption → highlight phrase → draw linked bounding box]

Write your caption using the Text tool (press T)
Highlight a phrase:
- Press H or click the Highlight tool
- Click and drag to select a phrase in your caption
- The phrase becomes highlighted with a colored background
Draw a bounding box:
- You'll be automatically prompted to draw a box
- Press R or click the Rectangle tool
- Click and drag on the image to draw a box around the object
- The box will be linked to your highlighted phrase with a matching color
Repeat for other phrases:
- Press H to highlight another phrase
- Draw its corresponding bounding box
- Continue until all key objects are annotated

Tip: Each phrase-box pair gets a unique color to help you quickly see which phrases link to which boxes.

⚠️
Highlighted phrases cannot overlap
Each highlighted phrase must be a distinct, non-overlapping text interval in your caption. You cannot highlight overlapping text regions. Plan your caption structure to ensure phrases are clearly separated.

Workflow 2: Boxes first, then link phrases

This workflow is ideal when you want to identify objects visually before describing them.

[GIF PLACEHOLDER: Draw bounding box → link by highlighting phrase in caption]

Draw a bounding box:
- Press R or click the Rectangle tool
- Click and drag on the image to draw a box around an object
- The box appears unlinked (no associated phrase yet)
Link to a phrase:
- You'll be automatically prompted to select a phrase
- Press H or click the Highlight tool
- Highlight the phrase in your caption that describes this object
- The box and phrase become linked with a matching color
Repeat for other objects:
- Press R to draw another box
- Highlight its corresponding phrase
- Continue until all key objects are annotated

Alternative: Use IntelliScribe to auto-link phrases

If you have unlinked boxes:

Press P or click IntelliScribe Phrases
The AI will automatically highlight phrases in your caption that match your boxes
Review and adjust the highlighted phrases as needed

Prerequisites: You must have at least one bounding box drawn and a caption written.

Learn more about IntelliScribe →

Edit or delete annotations

You can modify or remove annotations at any time. Choose the appropriate action based on what you need to change.

Edit a caption

Press T or click the Text tool
Click in the caption text area
Make your changes
Changes save automatically as you type

Use when: Fixing typos, adding details, rephrasing descriptions

Edit a bounding box

Click on the bounding box you want to adjust
Drag the box to reposition it
Drag the corner or edge handles to resize it
The linked phrase remains connected

Use when: Adjusting box position or size while keeping the same phrase link

Delete a bounding box

Press D or click the Delete tool
Click on the bounding box you want to remove
The box and its phrase link are deleted (the phrase text remains in the caption)

Use when: Removing incorrect boxes, duplicate annotations, or boxes on wrong objects

❗️
Deletions cannot be undone. Make sure you want to remove the annotation before clicking.

Delete a phrase link

Press D or click the Delete tool
Click on the highlighted phrase in the caption
The highlight and box link are removed (the box and caption text remain)

Use when: Unlinking a phrase to connect it to a different box, or removing phrase highlights while keeping boxes

❗️
Deletions cannot be undone. You can re-highlight the phrase to link it again.

Quick reference:

Edit — Adjust positions, fix typos, or rephrase descriptions
Delete — Remove incorrect boxes, unlink phrases, or start over

Save your work

Your annotations save automatically when you:

Draw or edit bounding boxes
Write or edit captions
Highlight phrases
Delete annotations
Navigate to another image (press E or Q)

✅
Auto-save enabled
All changes save automatically. You don't need to manually save your work. Use keyboard shortcuts to navigate efficiently through images.

Navigate between images

Efficient navigation helps you annotate large datasets quickly. Use keyboard shortcuts or mouse controls to move between images.

Move to the next image

After annotating an image, move to the next one:

Using keyboard (recommended for speed):

Press E → Next image
Press Q → Previous image

Using mouse:

Click the next image thumbnail in the bottom navigation strip
Click on any specific image thumbnail to jump directly to it

The bottom thumbnail strip shows all images in your dataset with "Annotated" badges indicating which images have phrase grounding annotations.

Pro tip: Use keyboard shortcuts to maintain annotation speed without switching between keyboard and mouse.

Keyboard shortcuts

Speed up annotation with keyboard shortcuts. Master these to significantly improve your annotation efficiency.

Essential shortcuts

Shortcut	Action	Use case
R	Rectangle tool	Draw bounding boxes around objects
T	Text tool	Write or edit the image caption
H	Highlight tool	Highlight phrases to link with boxes
D	Delete tool	Remove boxes or phrase links
C	IntelliScribe Caption	AI-generate a caption
P	IntelliScribe Phrases	AI-highlight phrases for unlinked boxes
E	Next image	Move to the next image after annotating
Q	Previous image	Go back to review previous annotations
Esc	Exit tool mode	Return to default selection mode
?	Show all shortcuts	View complete keyboard shortcut reference

Workflow optimization

Choose a workflow pattern that suits your annotation style. All workflows keep your hands on the keyboard for maximum speed.

Caption-first workflow

Efficient annotation pattern:

Press T → Type caption
Press H → Highlight phrase → Press R → Draw box
Repeat step 2 for other phrases
Press E to move to next image

Best for: Beginners, systematic annotation, descriptive-heavy tasks

Box-first workflow

Alternative pattern:

Press R → Draw box → Press H → Highlight phrase
Repeat for other objects
Press E to move to next image

Best for: Experienced annotators, visual identification, speed

AI-assisted workflow

Quick AI-assisted pattern:

Press C → Review generated caption → Press T to edit if needed
Press R → Draw boxes around objects
Press P → Review AI-highlighted phrases → Adjust if needed
Press E to move to next image

Best for: Large datasets, common objects, rapid annotation

Learn more about IntelliScribe →

See Annotation best practices for more efficiency tips.

💡
Pro tip
Press ? in the annotator to see the complete list of keyboard shortcuts. Press Esc to exit any tool mode and return to default selection.

Annotation guidelines

Creating effective phrase grounding annotations requires attention to both visual accuracy and text quality. Follow these guidelines for consistent, high-quality annotations.

Caption writing guidelines

Be descriptive and specific:

✅ Good:

"A large black FPGA chip in the center, several white tactile push buttons on the left side, a blue VGA connector at the bottom"
"Two red apples in a wooden bowl on the left, three green apples scattered on the right side of the table"
"A worker wearing a yellow safety helmet and orange vest standing next to a forklift"

❌ Too vague:

"Some electronics"
"Fruits on a table"
"A person at work"

Include spatial information:

Use positional terms to help the model understand object locations:

Position: "in the center", "on the left edge", "at the top", "near the bottom"
Relative position: "above the", "next to", "between the", "surrounding"
Direction: "facing left", "pointing upward", "horizontal", "vertical"

Use consistent terminology:

Choose specific object names and use them consistently across your dataset
Define a vocabulary for your domain (e.g., "FPGA chip" vs. "integrated circuit")
Standardize color descriptions (e.g., "blue" not "light blue" or "navy")
Use the same format for similar objects

Bounding box guidelines

Draw tight boxes:

Include the entire object but minimize empty space
Align box edges with object boundaries
For irregular objects, include the smallest rectangle that contains the whole object

Be consistent:

Draw boxes at the same level of detail across your dataset
If you include shadows or reflections for one object, do so for all similar objects
Maintain consistent box tightness (not too tight on some, loose on others)

Handle overlapping objects:

Draw separate boxes for each object, even if they overlap
Ensure each box clearly defines one object's boundaries
Link each box to its own phrase in the caption

Handle partially visible objects:

Draw boxes around visible portions of partially occluded objects
Include a phrase like "partially visible" in the caption if relevant
Be consistent about whether you annotate partially visible objects

Phrase linking guidelines

Phrases must not overlap:

Highlighted phrases must be distinct, non-overlapping text intervals in your caption. You cannot highlight the same words multiple times or create overlapping highlights.

✅ Good:

"A large black FPGA chip in the center" and "several white tactile buttons on the left" (separate, non-overlapping phrases)
"red apple" and "blue bowl" (completely distinct text regions)

❌ Invalid:

Highlighting "black FPGA chip" and then trying to highlight "FPGA chip in the center" (overlapping words)
Highlighting "red apple" and then highlighting "apple" again (duplicate highlight of same word)

Best practice: Write your caption with clearly separated descriptive phrases for each object you plan to annotate.

One phrase per box:

Each bounding box should link to exactly one phrase in the caption. Break complex descriptions into multiple linked annotations if needed.

✅ Good:

Phrase: "large black FPGA chip" → Box: around the FPGA chip
Phrase: "row of gold pin headers" → Box: around the pin headers
Phrase: "7-segment LED display" → Box: around the LED display

❌ Too complex:

Phrase: "large black FPGA chip with pin headers and an LED display" → Multiple boxes
Better: Split into three separate phrase-box pairs

Match phrase specificity to box:

If your phrase mentions color, size, or location, ensure these are accurate for the box
Don't use generic phrases like "object" or "thing" unless unavoidable
Include distinguishing features that help identify the specific object

Link all key objects:

Annotate all prominent objects in the image
Don't leave obvious objects without phrase-box links
For complex scenes, prioritize the most important 5-10 objects

Balance detail with practicality:

For simple scenes: Link most visible objects (5-8 phrase-box pairs)
For complex scenes: Focus on key objects and representative examples (8-12 pairs)
Avoid exhaustive annotation of every tiny detail

💡
How many annotations per image?
Minimum: 3-5 phrase-box pairs per image
Recommended: 5-10 phrase-box pairs per image
Complex scenes: 8-15 phrase-box pairs per image
More annotations = better model understanding, but quality matters more than quantity.

Troubleshooting

Common issues and solutions when creating phrase grounding annotations.

My annotations aren't saving

Possible causes and solutions:

Network connectivity — Check your internet connection. Annotations require an active connection to save.
Session timeout — If you've been idle, refresh the page and reopen the annotator.
Browser issues — Clear cache or try a different browser. Supported browsers: Chrome, Firefox, Safari, Edge.
Incomplete annotation — Ensure boxes are properly linked to phrases. Unlinked boxes may not save correctly.

Verify your annotations saved:

Check your dataset Explorer Annotations tab for statistics
Review the bottom thumbnail strip in the annotator for "Annotated" badges

I can't draw bounding boxes

Solution steps:

Activate Rectangle tool — Press R or click the Rectangle tool icon
Check tool mode — If another tool is active (H, T, D), press Esc to exit, then press R
Zoom level — If the image is too zoomed in/out, adjust zoom to see the area clearly
Browser permissions — Ensure your browser allows drawing on canvas elements

The interface is slow or lagging

Performance optimization:

Large images — Images over 10 MB may load slowly. Upload optimized images when possible.
Many annotations — Images with 20+ boxes may be slower. Consider splitting complex scenes across multiple images.
Browser resources — Close unnecessary tabs and applications.
Use keyboard shortcuts — Keyboard navigation is faster than mouse clicks.

How do I unlink a phrase from a box?

Solution:

Press D or click the Delete tool
Click on the highlighted phrase in the caption (not the box)
The phrase highlight and link are removed, but the bounding box remains
You can now highlight a different phrase to link to that box

Alternatively, delete the box and redraw it linked to the correct phrase.

IntelliScribe isn't working

Common issues:

IntelliScribe Caption (C) — Works on any image. If it fails, check your internet connection.
IntelliScribe Phrases (P) — Requires:
- At least one bounding box drawn
- A caption with text
- Refresh the page if it still doesn't work

Learn more about IntelliScribe →

I accidentally deleted annotations

Unfortunately, deletions cannot be undone. Best practices to prevent this:

Be careful with Delete tool — Press D intentionally, not by accident
Exit Delete mode — Press Esc after deleting to prevent accidental deletions
Export backups — Regularly download your dataset as a backup
Review before navigating — Check your work before pressing E to move to the next image

Common questions

Should I write captions before or after drawing boxes?

Both approaches work:

Caption first (recommended for beginners):

Write complete caption describing all objects
Highlight phrases systematically
Draw boxes for each highlighted phrase
More organized and less likely to miss objects

Boxes first (faster for experienced annotators):

Identify objects visually
Draw boxes quickly
Link to phrases in caption
Use IntelliScribe Phrases to auto-highlight

Choose based on your preference and workflow. See Link phrases to bounding boxes for both workflows.

How detailed should my captions be?

Balance detail with practicality:

Include:

Main objects and their attributes (color, size, shape)
Spatial relationships and positions
Distinguishing features
Counts when relevant ("three apples", "several buttons")

Avoid:

Excessive detail about unimportant elements
Subjective opinions ("beautiful", "interesting")
Information not visible in the image
Redundant descriptions

Good length: 10-30 words for simple scenes, 20-50 words for complex scenes

See Caption writing guidelines for examples.

Can I annotate the same object multiple times?

Generally no, but there are exceptions:

Standard practice:

Each object gets one bounding box
Each box links to one phrase
No duplicate boxes on the same object

Exceptions:

Object has multiple distinct parts worth annotating separately
Object appears in different contexts requiring different descriptions
Dataset requires multiple annotation styles for comparison

Be consistent with your approach across the entire dataset.

What if objects overlap?

Draw separate boxes for each object:

Draw a box around the first object (even if partially occluded)
Draw a box around the second object
Boxes can overlap in the interface
Link each box to its own phrase in the caption

Example:

Caption: "A red apple behind a blue bowl on the left, a green apple visible in the bowl"
Box 1: Around visible part of red apple → "red apple behind a blue bowl"
Box 2: Around blue bowl → "blue bowl on the left"
Box 3: Around green apple → "green apple visible in the bowl"

Can I annotate images collaboratively?

Yes! Multiple team members can annotate different images in the same dataset:

Add team members to your organization
Assign different images to different annotators
Create annotation guidelines based on this documentation
Reference Annotation guidelines for consistency

Pro tip: Create a style guide for your team covering:

Vocabulary and terminology to use
How tight to draw boxes
How detailed captions should be
Which objects to prioritize

Can I import existing phrase grounding annotations?

Yes! You can upload pre-existing phrase grounding annotations:

Upload phrase grounding annotations in supported formats
This is useful if you have annotations from other tools or datasets
You can then edit these annotations in the annotator

After importing, check your dataset Explorer to verify your uploaded annotations.

Should I use IntelliScribe or annotate manually?

Use IntelliScribe to speed up annotation, but always review:

IntelliScribe strengths:

Fast caption generation
Good for common objects
Useful for getting started quickly
Helpful for large datasets

When to annotate manually:

Domain-specific objects
Precise terminology required
Complex relationships between objects
Quality over speed

Best approach: Combine both:

Use IntelliScribe Caption (press C) to generate initial caption
Edit caption manually (press T) to add domain-specific terms
Draw boxes manually (press R) for precision
Use IntelliScribe Phrases (press P) to auto-link phrases
Review and adjust links as needed

Learn more about IntelliScribe →

Annotation best practices

Follow these guidelines to create high-quality phrase grounding annotations that improve model performance.

Consistency is key

Use the same terminology for similar objects across images
Draw boxes at consistent tightness across your dataset
Follow the same caption writing guidelines for all images
Create a style guide if annotating collaboratively

Quality over quantity

Accurate boxes are more valuable than many imprecise ones
Take time to write descriptive captions
Review and edit annotations when you spot errors
Check the bottom thumbnail strip to review which images have been annotated

Diverse coverage

Annotate objects at different scales (small, medium, large)
Include objects in different positions and contexts
Cover different lighting conditions and angles
Include partially visible objects when relevant

Efficient workflow

Learn keyboard shortcuts to speed up annotation
Choose a workflow (caption-first vs. box-first) and stick with it
Use IntelliScribe to speed up initial annotation
Monitor your progress through the dataset Explorer Annotations tab
Take breaks to maintain annotation quality

What's next?

✅
Annotations complete!
Your phrase grounding annotations are ready for training. Your model will learn to understand the connection between text descriptions and visual objects.

Review Annotation best practices to ensure your dataset is optimized for training.

Next steps

Train a Phrase Grounding Model

Use your annotations to fine-tune a vision-language model

Upload More Annotations

Import existing phrase grounding annotations

Learn About Phrase Grounding

Understand how phrase grounding works and best practices

AI-Assisted Tools

Learn more about IntelliScribe features

Related resources

Annotate for VQA — Visual question answering annotations
Upload annotations — Import existing phrase grounding annotations
Phrase grounding concepts — Deep dive into phrase grounding
View dataset insights — Analyze your annotation quality
Manage datasets — Organize and maintain your datasets
AI-assisted tools — Learn about IntelliScribe features
Add team members — Collaborate on annotation projects

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects