Phrase Grounding

Learn how Phrase Grounding localizes objects in images using natural language descriptions instead of pre-defined categories.

Phrase grounding, also known as visual grounding or referring expression comprehension, is a computer vision task that localizes specific objects or regions in an image based on natural language descriptions. Given an image and a text phrase, the system identifies and returns the spatial location of the referenced object as a bounding box.

Instead of training a model to detect pre-defined categories like "car" or "person," you can describe what you're looking for in natural language—"the red car on the left" or "the person wearing a blue jacket"—and the model finds it.

📘

When to use Phrase Grounding

Use phrase grounding when you need flexible object detection with natural language descriptions. Perfect for robotics, image editing, and scenarios where object categories change frequently. See how it compares to traditional object detection below.

🚀

Ready to get started?


How it works

Phrase grounding systems combine computer vision and natural language processing to connect words with image regions. Vision-language models process both modalities simultaneously to understand and locate objects.

Image encoding

The system extracts visual features from the input image using backbone networks (such as ResNet or Vision Transformers). This process:

  • Captures visual information at multiple scales
  • Identifies low-level details (edges, textures, colors)
  • Recognizes high-level semantic information (object types, scenes)
  • Creates a rich representation of everything in the image

Think of this as the VLM building a detailed visual understanding of the entire scene.

Text encoding

The referring expression is processed through language models (such as BERT or RoBERTa) to create contextualized embeddings. This step:

  • Understands the semantic meaning of words and phrases
  • Captures relationships between words ("the cup next to the plate")
  • Identifies important attributes (colors, sizes, positions)
  • Interprets context and intent

The model learns what you're asking for and what characteristics matter for object localization.

Cross-modal fusion

Visual and linguistic features are aligned through attention mechanisms or transformers. This critical step:

  • Associates text descriptions with image regions
  • Learns which parts of the image match which words
  • Handles complex relationships and attributes
  • Resolves ambiguity when multiple similar objects exist

The vision-language model connects your words to specific visual elements.

Localization

The fused representation predicts the spatial location of the referred object, outputting:

  • Bounding box coordinates — A rectangular box around the target object
  • Segmentation mask (advanced) — Precise pixel-level outline of the object

The model draws a box around exactly what you described, which you can then visualize or process.

Modern vision-language models

Datature's VLMs use self-attention and cross-attention to understand complex expressions with negations, relationships, and multiple attributes—enabling rich, contextual grounding.

Learn how to train your own VLM →


Common use cases

Phrase grounding enables powerful applications across many industries. See how natural language object detection transforms workflows:

🤖 Robotics

Enabling robots to identify and manipulate objects using natural language commands

Instead of programming robots to recognize specific objects, use natural language to direct them flexibly and intuitively.

Example commands:

  • "Pick up the red mug on the left"
  • "Move the largest box to the shelf"
  • "Grab the wrench next to the screwdriver"

This makes robots more adaptable to changing environments and easier to control without reprogramming.

🥽 Augmented reality

Overlaying information on specific real-world objects identified through natural language

Users can point at objects and ask questions or request information using natural descriptions rather than pre-programmed markers.

Example applications:

  • "Show me information about the painting on the right wall"
  • "Highlight the circuit board component labeled A3"
  • "Display specs for the machine in the center"
🎨 Image editing

Selecting and editing image regions using natural language instead of manual tools

Replace tedious manual selection with simple text descriptions to speed up editing workflows.

Example selections:

  • "Select the person in the red shirt"
  • "Highlight all the products on the top shelf"
  • "Mask the car in the background"

This makes image editing more accessible to non-technical users who struggle with traditional selection tools.

🚗 Autonomous vehicles

Localizing traffic elements, pedestrians, or obstacles based on descriptive commands

Help autonomous systems identify and respond to specific objects in complex traffic scenarios.

Example detections:

  • "The pedestrian crossing from the right"
  • "The stop sign partially hidden by the tree"
  • "The vehicle merging from the left lane"

This enables more nuanced understanding of traffic situations beyond simple category detection.

🏥 Medical imaging

Identifying specific anatomical structures or anomalies described in radiology reports

Connect textual medical descriptions directly to image regions without manual annotation.

Example identifications:

  • "The lesion in the upper right lobe"
  • "The fracture near the distal end"
  • "The enlarged lymph node adjacent to the trachea"
🚧

Important

Phrase grounding should assist, not replace, medical professionals. Always have qualified healthcare providers review and validate AI-generated localizations.

📦 Retail and inventory

Locating specific products in warehouse images or store shelves

Find products quickly using natural descriptions rather than maintaining complex product databases.

Example searches:

  • "The blue box on the third shelf"
  • "The damaged package in the corner"
  • "All products with red labels"

This speeds up inventory management, picking operations, and quality control.

🎥 Video analysis

Finding specific objects or people across video frames

Track and locate specific elements in video content using textual descriptions for surveillance or content indexing.

Example tracking:

  • "The person wearing the yellow jacket"
  • "The black sedan that entered from the north entrance"
  • "The package placed on the counter at 3:15 PM"

Enables efficient video search without manually reviewing hours of footage.


Phrase grounding vs other vision tasks

Understanding how phrase grounding differs from other computer vision tasks helps you choose the right approach for your dataset and application:

TaskInputOutputBest for
Phrase GroundingImage + text descriptionBounding box for that objectFinding objects described in natural language
Visual Question AnsweringImage + questionText answerGetting information about images conversationally
Object DetectionImage onlyBoxes for pre-defined classesDetecting known object categories
Image ClassificationImage onlySingle category labelCategorizing entire images
Dense CaptioningImage onlyDescriptions of image regionsGenerating captions for all regions
💡

Choose Phrase Grounding when:

  • You need to locate objects using natural language descriptions
  • Object categories vary and aren't pre-defined
  • You want to describe objects by attributes or relationships
  • You're building interactive systems that respond to user descriptions

See best practices for writing effective phrases.

Learn more about Visual Question Answering →


Phrase grounding vs traditional object detection

Understanding the key differences helps you choose the right approach for your training project and use case:

Traditional object detection

Fixed categories Detects pre-defined object classes you specified during training (e.g., "car", "person", "defect").

Best for:

  • Consistent, well-defined categories
  • High-volume detection of known objects
  • When categories don't change

Example: Detecting defects of known types on a production line

Phrase grounding

Natural language Localizes objects based on flexible text descriptions without pre-defined categories.

Best for:

  • Flexible, changing requirements
  • Describing objects by attributes or relationships
  • Interactive or conversational systems

Example: "Find the scratched product on the left conveyor belt"


Getting started with phrase grounding

Ready to build your own phrase grounding system? Here's your path forward:


Tips for better phrase grounding results

Follow these best practices when creating annotations and training models for phrase grounding:

Write descriptive, specific phrases

Good phrases include distinguishing attributes:

Good: "The red mug on the left side of the table" ❌ Too vague: "The mug"

Good: "The person wearing a blue jacket and glasses" ❌ Ambiguous: "That person"

Include colors, positions, sizes, and relationships to help the VLM distinguish between similar objects.

Learn more about annotation best practices →

Use positional and relational descriptions

Leverage spatial relationships:

  • Absolute positions: "on the left", "in the center", "at the top"
  • Relative positions: "next to the door", "behind the car", "under the table"
  • Ordinal positions: "the first chair", "the rightmost box", "the tallest building"

These help disambiguate when multiple similar objects are present.

Include multiple attributes

Combine multiple characteristics:

Better: "The small red box on the top shelf" ⚠️ Okay: "The red box"

Multiple attributes make it easier for the model to identify the exact target object, especially in cluttered scenes.

Provide diverse training examples

Train your VLM with varied scenarios:

  • Different object types and categories
  • Various phrasings for similar objects
  • Multiple attributes (color, size, position, condition)
  • Complex scenes with many objects
  • Different lighting conditions and viewpoints

More diversity in your training dataset = better generalization to new descriptions and scenes.

Learn about dataset preparation →

Be consistent with terminology

Use consistent language across your dataset:

  • Choose one term and stick with it ("damaged" vs. "defective" vs. "broken")
  • Use consistent color names ("red" not "crimson" or "scarlet")
  • Maintain consistent spatial descriptions ("left" not "left side" or "left-hand side")

Consistency helps the model learn more effectively and produce reliable results.


Common questions

How many phrase-box pairs do I need?

For good model performance, aim for:

  • Minimum: 100 phrase-box pairs across at least 20 images
  • Recommended: 500+ pairs across 100+ images
  • Optimal: 1000+ pairs for production use

More diverse examples with varied phrases and object types lead to better performance on new descriptions. See dataset insights to track your annotation progress.

Learn how to upload annotations →

Can I use phrase grounding without training?

Yes! Vision-language models support zero-shot grounding, where you can describe objects without any training examples.

Zero-shot works well for:

  • Quick experiments and prototypes
  • Common objects with standard descriptions
  • Scenarios where you don't have training data

For specialized domains or higher accuracy, fine-tuning with your own dataset is recommended. Use pre-trained models as a starting point.

Learn how to run zero-shot inference →

What's the difference between Phrase Grounding and VQA?

Phrase Grounding localizes objects with bounding boxes based on descriptions.

  • Input: Image + description ("the red car")
  • Output: Bounding box coordinates

Visual Question Answering answers questions about images with text.

  • Input: Image + question ("What color is the car?")
  • Output: Text answer ("red")

Use phrase grounding to find things, VQA to understand things.

Learn more about VQA →

Can I ground multiple objects at once?

Yes! You can provide multiple phrases to locate multiple objects in the same image during inference:

  • "The red car" → Box 1
  • "The blue bicycle" → Box 2
  • "The person in yellow" → Box 3

Each phrase gets its own bounding box, allowing you to locate everything you need in one pass. Process the results to extract and visualize all detections.

Learn about handling multiple results →

Does it work with complex expressions?

Modern vision-language models handle complex expressions including:

  • Negations: "The dog that is NOT wearing a collar"
  • Relationships: "The cup next to the laptop"
  • Multiple attributes: "The large red car parked behind the building"
  • Conditions: "The damaged box on the bottom shelf"

The more context you provide, the more precisely the model can identify the target.

Can I combine phrase grounding with other tasks?

Absolutely! You can create multiple datasets with different task types and use them together in your application workflow.

For example:

  1. Use phrase grounding to locate specific products
  2. Use VQA to answer questions about those products
  3. Use classification to categorize them

This creates powerful, flexible vision systems that go beyond single-task models. Train separate models for each task and combine them in your application.

Learn about VQA →


Related resources