Phrase Grounding

Learn how phrase grounding localizes objects in images using natural language descriptions instead of pre-defined categories.

Phrase grounding is a computer vision task that localizes specific objects or regions in an image based on natural language descriptions. Given an image and a text phrase, the system identifies and returns the spatial location of the referenced object as a bounding box.

Instead of training a model to detect pre-defined categories like "car" or "person," you describe what you want ("the red car on the left" or "the person wearing a blue jacket") and the model finds it.

New to Datature Vi?

Datature Vi lets you train a custom VLM for phrase grounding on your own images. Learn what Datature Vi does or follow the quickstart.

Best for
  • Flexible object detection where categories vary or aren't pre-defined
  • Describing objects by attributes, relationships, or position
  • Interactive systems that respond to user-provided descriptions
  • Robotics, image editing, autonomous vehicles, and warehouse automation
Not for
  • High-volume detection of a fixed, known set of object classes
  • Tasks where bounding boxes aren't needed (use visual question answering instead)
  • Real-time detection with strict latency requirements
By the end of this guide

Understand how phrase grounding returns bounding boxes for natural language descriptions, enabling flexible object localization without pre-defined categories.


How phrase grounding works

A VLM processes both the image and your text description together through four stages:

1

Image encoding

A visual encoder (ResNet or Vision Transformer) extracts features at multiple scales, from edges and textures to object types and scenes.

2

Text encoding

A language model converts your description into embeddings that capture meaning, word relationships, and attributes like color, size, and position.

3

Cross-modal fusion

Attention mechanisms align text descriptions with image regions, learning which words match which visual areas.

4

Localization

The model predicts a bounding box around the target object. Datature Vi's VLMs handle complex expressions with negations, relationships, and multiple attributes.


Common use cases

Robotics

Direct robots with natural language: 'Pick up the red mug on the left' or 'Move the largest box to the shelf.'

Augmented reality

Overlay information on objects identified by description: 'Show specs for the machine in the center.'

Image editing

Select regions with text instead of manual tools: 'Mask the car in the background.'

Autonomous vehicles

Localize traffic elements by description: 'The stop sign partially hidden by the tree.'

Medical imaging

Connect radiology descriptions to image regions: 'The lesion in the upper right lobe.'

Retail and inventory

Find products by description: 'The damaged package in the corner' or 'All products with red labels.'


Phrase grounding vs. other vision tasks

Task
Input
Output
Best for
Phrase Grounding
Image + text description
Bounding box for that object
Finding objects described in natural language
Visual Question Answering
Image + question
Text answer
Getting information about images conversationally
Object Detection
Image only
Boxes for pre-defined classes
Detecting known object categories
Image Classification
Image only
Single category label
Categorizing entire images
Dense Captioning
Image only
Descriptions of image regions
Generating captions for all regions

Use phrase grounding to locate specific objects. Use visual question answering to understand or analyze image content.


Phrase grounding vs. traditional object detection

Traditional object detection detects pre-defined object classes you specified during training (such as "car" or "person"). It works well for consistent, well-defined categories and high-volume detection of known objects.

Phrase grounding localizes objects based on flexible text descriptions without pre-defined categories. It suits flexible, changing requirements, objects described by attributes or relationships, and interactive or conversational systems.

Example: Traditional detection finds every "car" in the image. Phrase grounding finds "the scratched car on the left conveyor belt."


Tips for better phrase grounding results

Practice
Good example
Avoid
Be specific with attributes
"The red mug on the left side of the table"
"The mug"
Use spatial relationships
"The box next to the door" or "the rightmost shelf"
"That box over there"
Combine multiple attributes
"The small red box on the top shelf"
"The red box"
Stay consistent with terms
Always use "damaged" across the dataset
Mix "damaged", "defective", "broken"

Provide diverse training examples. Vary object types, phrasings, attributes, scene complexity, lighting, and viewpoints. More diversity produces better generalization. See dataset preparation and annotation best practices.


Frequently asked questions

For good model performance, aim for:

  • Minimum: 100 phrase-box pairs across at least 20 images
  • Recommended: 500+ pairs across 100+ images
  • Production use: 1,000+ pairs

More diverse examples with varied phrases and object types lead to better performance on new descriptions.

Learn how to upload annotations →

Yes. Vision-language models support zero-shot grounding, where you can describe objects without any training examples. Zero-shot works well for quick experiments, common objects with standard descriptions, and scenarios where you don't have training data.

For specialized domains or higher accuracy, fine-tuning with your own dataset is recommended.

Learn how to run zero-shot inference →

Phrase grounding localizes objects with bounding boxes based on descriptions.

  • Input: Image + description ("the red car")
  • Output: Bounding box coordinates

Visual question answering answers questions about images with text.

  • Input: Image + question ("What color is the car?")
  • Output: Text answer ("red")

Use phrase grounding to find things, VQA to understand things.

Learn more about VQA →

Yes. You can provide multiple phrases to locate multiple objects in the same image during inference. Each phrase gets its own bounding box, so you can locate everything you need in one pass.

Learn about handling multiple results →

Modern vision-language models handle complex expressions including:

  • Negations: "The dog that is NOT wearing a collar"
  • Relationships: "The cup next to the laptop"
  • Multiple attributes: "The large red car parked behind the building"
  • Conditions: "The damaged box on the bottom shelf"

Yes. You can create multiple datasets with different task types and use them together. For example: use phrase grounding to locate specific products, then use VQA to answer questions about those products. Train separate models for each task and combine them in your application.

Learn about VQA →


How phrase grounding works in Datature Vi

The workflow for phrase grounding in Datature Vi follows four steps: create a phrase grounding dataset, upload images, annotate with text-linked bounding boxes, and train.

After training, run inference with the Vi SDK to get bounding boxes for your text descriptions.

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)
result, error = model(source="warehouse.jpg", user_prompt="Identify the red box on the third shelf", stream=False)
# result contains bounding boxes in normalized [0, 1024] coordinates
# Convert to pixels: pixel_x = (x / 1024) * image_width

For full inference options, see the Vi SDK inference docs.

Datature Vi inference returns bounding boxes in a normalized coordinate system where both dimensions are scaled to [0, 1024], regardless of the actual image size. To convert to pixel coordinates:

pixel_x = (x / 1024) * image_width pixel_y = (y / 1024) * image_height

Example: a box at [256, 512, 768, 900] on a 1920x1080 image converts to [480, 540, 1440, 949] pixels.

Fine-tuning uses the same token-level cross-entropy as other VLM outputs: the supervised sequence includes language plus box corners in the [0, 1024] text format the model emits at inference. See How Does VLM Training Work? (accordion How does phrase grounding training connect boxes to the loss?) and Bounding box format.

For more on how phrase grounding connects to other Datature Vi concepts, see the Visual Grounding glossary entry.


Related resources

Annotate For Phrase Grounding

Learn how to create phrase-box annotation pairs.

Train A VLM

Fine-tune a model on your phrase grounding dataset.

Visual Question Answering

Learn about the other core dataset type in Datature Vi.