Robotics and Physical AI

Use Datature Vi to ground language instructions in visual scenes for robot manipulation, pick-and-place tasks, and Vision-Language-Action workflows.

Yellow bin containing mixed products for robotic pick-and-place sorting

Robots work best when they can understand plain-language instructions: "pick up the red mug on the left shelf" or "move the blue package to the staging area." The challenge is connecting those words to what the robot's camera actually sees. Which object is the "red mug"? Where exactly is "the left shelf"?

Datature Vi trains models that link language to images. You show the model examples of objects in your workspace, label them with natural descriptions, and the model learns to find those objects in new camera frames. The robot's control system then uses that location information to plan its movements.

This applies to warehouse pick-and-place, manufacturing assembly, bin picking, and any task where a robot needs to understand what it is looking at before acting on it.

For an interactive overview of this application, visit the robotic pick and place use case on vi.datature.com.


Common applications

Task
What the model does
Object grounding
Returns bounding box for "the red cylinder on the second shelf"
Scene QA
Answers "Is the gripper path clear?" or "Which bin is empty?"
Task planning support
Describes the current scene state for a planning system
Error detection
Identifies when an object is misplaced or a task has failed
Human-robot instruction
Interprets natural language commands into scene-grounded actions

Object grounding for manipulation

The most direct use case: given a language instruction, locate the target object in the image.

Task type: Phrase Grounding

Phrase grounding returns bounding box coordinates for objects described in natural language, in exactly the format a robot arm's motion planner needs.

Annotation examples:

Image
Caption phrase
Bounding box
Shelf with mixed objects
the red cylindrical container
[312, 180, 445, 390]
Bin with parts
the smallest gear on the left
[85, 220, 160, 280]
Table scene
the blue screwdriver near the edge
[670, 410, 780, 510]

At inference:

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

result, error = model(
    source="workspace_camera.jpg",
    user_prompt="Pick up the red cylindrical container on the second shelf."
)

if error is None:
    for grounded in result.result.groundings:
        print(f"Object: {grounded.phrase}")
        for bbox in grounded.grounding:
            x_min, y_min, x_max, y_max = bbox
            # Convert to pixel coordinates and pass to motion planner
            center_x = (x_min + x_max) / 2 / 1024
            center_y = (y_min + y_max) / 2 / 1024
            print(f"Normalized center: ({center_x:.3f}, {center_y:.3f})")
Bounding box coordinate format

Vi returns bounding boxes in normalized coordinates in the range [0, 1024]. Scale by your image dimensions to get pixel coordinates. See Prediction Schemas for conversion code.


Scene understanding and QA

Use VQA to answer questions about the current state of the workspace, feeding answers into a planning or control system.

Task type: VQA

Image
Question
Answer
Clear path
Is the path between the robot and the staging area clear?
Yes, the path is clear with no obstacles.
Blocked path
Is the path between the robot and the staging area clear?
No, there is a cardboard box blocking the center of the path.
Empty bin
Which bin is empty and ready for loading?
The bin on the far right is empty.

Structured scene state

For tighter integration with planning systems, use structured data extraction to return scene state as JSON:

{
  "path_clear": false,
  "obstacle_type": "cardboard box",
  "obstacle_location": "center path",
  "recommended_action": "remove obstacle before proceeding"
}

Vision-Language-Action (VLA) workflows

In VLA systems, Vi acts as the perception and language grounding layer. The typical integration pattern:

Camera image → Vi VLM → Scene understanding / object location
                                  ↓
                       Task planner or policy network
                                  ↓
                          Robot action execution

Vi's role: Translate natural language instructions and camera images into structured outputs (bounding boxes, scene descriptions, state JSON) that your action model or planner consumes.

What Vi does not do: Vi does not directly output robot joint angles or action primitives. It provides the perception and language grounding; your downstream system handles action selection.


Multi-step task reasoning

For tasks that require checking multiple conditions before acting, use chain-of-thought reasoning:

Example: pre-grasp scene check:

System prompt:

You are a robot vision assistant. Before confirming a grasp action, check:
1. Is the target object visible and unobstructed?
2. Is there sufficient clearance around the object for the gripper?
3. Is the object orientation suitable for grasping?

Reason through each step, then give a final recommendation.

The model's <think> output (accessible via result.thinking) shows the reasoning chain; the final answer gives a clear go/no-go.


Training tips for robotics

Match your camera setup exactly: train on images from the same camera, mounting position, and lighting conditions used in deployment. Perspective and lighting changes degrade localization accuracy significantly.

Capture object variation: the same object appears differently based on orientation, occlusion, and distance. Include multiple views of each target object in your training data.

Use precise spatial language: annotations should use consistent spatial terms that match how your system issues commands: "left", "right", "near edge", "top shelf". Avoid ambiguous phrasing.

Test at inference speed: for real-time applications, measure end-to-end latency (camera capture → model inference → output) with your deployed hardware before integrating into a control loop.

Start with static scenes: train and validate in static environments before moving to dynamic or cluttered workspaces. Add complexity incrementally.


Next steps

Phrase Grounding

Full reference for object localization with natural language, bounding box format, and annotation best practices.

Chain-of-Thought Reasoning

Add multi-step reasoning for complex scene assessment and task planning support.

Structured Data Extraction

Return scene state as JSON for direct integration with planning and control systems.