Phrase Grounding
Learn how phrase grounding localizes objects in images using natural language descriptions instead of pre-defined categories.
Phrase grounding is a computer vision task that localizes specific objects or regions in an image based on natural language descriptions. Given an image and a text phrase, the system identifies and returns the spatial location of the referenced object as a bounding box.
Instead of training a model to detect pre-defined categories like "car" or "person," you describe what you want ("the red car on the left" or "the person wearing a blue jacket") and the model finds it.
Datature Vi lets you train a custom VLM for phrase grounding on your own images. Learn what Datature Vi does or follow the quickstart.
- Flexible object detection where categories vary or aren't pre-defined
- Describing objects by attributes, relationships, or position
- Interactive systems that respond to user-provided descriptions
- Robotics, image editing, autonomous vehicles, and warehouse automation
- High-volume detection of a fixed, known set of object classes
- Tasks where bounding boxes aren't needed (use visual question answering instead)
- Real-time detection with strict latency requirements
Understand how phrase grounding returns bounding boxes for natural language descriptions, enabling flexible object localization without pre-defined categories.
How phrase grounding works
A VLM processes both the image and your text description together through four stages:
Image encoding
A visual encoder (ResNet or Vision Transformer) extracts features at multiple scales, from edges and textures to object types and scenes.
Text encoding
A language model converts your description into embeddings that capture meaning, word relationships, and attributes like color, size, and position.
Cross-modal fusion
Attention mechanisms align text descriptions with image regions, learning which words match which visual areas.
Localization
The model predicts a bounding box around the target object. Datature Vi's VLMs handle complex expressions with negations, relationships, and multiple attributes.
Common use cases
Robotics
Direct robots with natural language: 'Pick up the red mug on the left' or 'Move the largest box to the shelf.'
Augmented reality
Overlay information on objects identified by description: 'Show specs for the machine in the center.'
Image editing
Select regions with text instead of manual tools: 'Mask the car in the background.'
Autonomous vehicles
Localize traffic elements by description: 'The stop sign partially hidden by the tree.'
Medical imaging
Connect radiology descriptions to image regions: 'The lesion in the upper right lobe.'
Retail and inventory
Find products by description: 'The damaged package in the corner' or 'All products with red labels.'
Phrase grounding vs. other vision tasks
Use phrase grounding to locate specific objects. Use visual question answering to understand or analyze image content.
Phrase grounding vs. traditional object detection
Traditional object detection detects pre-defined object classes you specified during training (such as "car" or "person"). It works well for consistent, well-defined categories and high-volume detection of known objects.
Phrase grounding localizes objects based on flexible text descriptions without pre-defined categories. It suits flexible, changing requirements, objects described by attributes or relationships, and interactive or conversational systems.
Example: Traditional detection finds every "car" in the image. Phrase grounding finds "the scratched car on the left conveyor belt."
Tips for better phrase grounding results
Provide diverse training examples. Vary object types, phrasings, attributes, scene complexity, lighting, and viewpoints. More diversity produces better generalization. See dataset preparation and annotation best practices.
Frequently asked questions
How phrase grounding works in Datature Vi
The workflow for phrase grounding in Datature Vi follows four steps: create a phrase grounding dataset, upload images, annotate with text-linked bounding boxes, and train.
For more on how phrase grounding connects to other Datature Vi concepts, see the Visual Grounding glossary entry.
Related resources
Updated 30 days ago
