Robotics and Physical AI

Robots work best when they can understand plain-language instructions: "pick up the red mug on the left shelf" or "move the blue package to the staging area." The challenge is connecting those words to what the robot's camera actually sees. Which object is the "red mug"? Where exactly is "the left shelf"?

Datature Vi trains models that link language to images. You show the model examples of objects in your workspace, label them with natural descriptions, and the model learns to find those objects in new camera frames. The robot's control system then uses that location information to plan its movements.

This applies to warehouse pick-and-place, manufacturing assembly, bin picking, and any task where a robot needs to understand what it is looking at before acting on it.

For an interactive overview of this application, visit the robotic pick and place use case on vi.datature.com.

Common applications

Task

What the model does

Object grounding

Returns bounding box for "the red cylinder on the second shelf"

Scene QA

Answers "Is the gripper path clear?" or "Which bin is empty?"

Task planning support

Describes the current scene state for a planning system

Error detection

Identifies when an object is misplaced or a task has failed

Human-robot instruction

Interprets natural language commands into scene-grounded actions

Object grounding for manipulation

The most direct use case: given a language instruction, locate the target object in the image.

Task type: Phrase Grounding

Phrase grounding returns bounding box coordinates for objects described in natural language, in exactly the format a robot arm's motion planner needs.

Annotation examples:

Image

Caption phrase

Bounding box

Shelf with mixed objects

the red cylindrical container

[312, 180, 445, 390]

Bin with parts

the smallest gear on the left

[85, 220, 160, 280]

Table scene

the blue screwdriver near the edge

[670, 410, 780, 510]

At inference:

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

result, error = model(
    source="workspace_camera.jpg",
    user_prompt="Pick up the red cylindrical container on the second shelf."
)

if error is None:
    for grounded in result.result.groundings:
        print(f"Object: {grounded.phrase}")
        for bbox in grounded.grounding:
            x_min, y_min, x_max, y_max = bbox
            # Convert to pixel coordinates and pass to motion planner
            center_x = (x_min + x_max) / 2 / 1024
            center_y = (y_min + y_max) / 2 / 1024
            print(f"Normalized center: ({center_x:.3f}, {center_y:.3f})")

Bounding box coordinate format

Vi returns bounding boxes in normalized coordinates in the range [0, 1024]. Scale by your image dimensions to get pixel coordinates. See Prediction Schemas for conversion code.

Scene understanding and QA

Use VQA to answer questions about the current state of the workspace, feeding answers into a planning or control system.

Task type: VQA

Image

Question

Answer

Clear path

Is the path between the robot and the staging area clear?

Yes, the path is clear with no obstacles.

Blocked path

Is the path between the robot and the staging area clear?

No, there is a cardboard box blocking the center of the path.

Empty bin

Which bin is empty and ready for loading?

The bin on the far right is empty.

Structured scene state

For tighter integration with planning systems, use structured data extraction to return scene state as JSON:

{
  "path_clear": false,
  "obstacle_type": "cardboard box",
  "obstacle_location": "center path",
  "recommended_action": "remove obstacle before proceeding"
}

Vision-Language-Action (VLA) workflows

In VLA systems, Vi acts as the perception and language grounding layer. The typical integration pattern:

Camera image → Vi VLM → Scene understanding / object location
                                  ↓
                       Task planner or policy network
                                  ↓
                          Robot action execution

Vi's role: Translate natural language instructions and camera images into structured outputs (bounding boxes, scene descriptions, state JSON) that your action model or planner consumes.

What Vi does not do: Vi does not directly output robot joint angles or action primitives. It provides the perception and language grounding; your downstream system handles action selection.

Multi-step task reasoning

For tasks that require checking multiple conditions before acting, use chain-of-thought reasoning:

Example: pre-grasp scene check:

System prompt:

You are a robot vision assistant. Before confirming a grasp action, check:
1. Is the target object visible and unobstructed?
2. Is there sufficient clearance around the object for the gripper?
3. Is the object orientation suitable for grasping?

Reason through each step, then give a final recommendation.

The model's <think> output (accessible via result.thinking) shows the reasoning chain; the final answer gives a clear go/no-go.

Training tips for robotics

Match your camera setup exactly: train on images from the same camera, mounting position, and lighting conditions used in deployment. Perspective and lighting changes degrade localization accuracy significantly.

Capture object variation: the same object appears differently based on orientation, occlusion, and distance. Include multiple views of each target object in your training data.

Use precise spatial language: annotations should use consistent spatial terms that match how your system issues commands: "left", "right", "near edge", "top shelf". Avoid ambiguous phrasing.

Test at inference speed: for real-time applications, measure end-to-end latency (camera capture → model inference → output) with your deployed hardware before integrating into a control loop.

Start with static scenes: train and validate in static environments before moving to dynamic or cluttered workspaces. Add complexity incrementally.

Next steps

Phrase Grounding

Full reference for object localization with natural language, bounding box format, and annotation best practices.

Chain-of-Thought Reasoning

Add multi-step reasoning for complex scene assessment and task planning support.

Structured Data Extraction

Return scene state as JSON for direct integration with planning and control systems.